You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mike Hugo <mi...@piragua.com> on 2012/01/24 19:34:22 UTC

HTMLStripCharFilterFactory not working in Solr4?

We recently updated to the latest build of Solr4 and everything is working
really well so far!  There is one case that is not working the same way it
was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
registered, for example) in a field as defined below - it was working in
Solr3.4 with the configuration shown here, but is not working the same way
in Solr4.

The label field is defined as type="text_general"
<field name="label" type="text_general" indexed="true" stored="false"
required="false" multiValued="true"/>

Here's the type definition for text_general field:
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"
                        enablePositionIncrements="true"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"
                        enablePositionIncrements="true"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>
        </fieldType>


In Solr 3.4, that configuration was completely stripping html constructs
out of the indexed field which is exactly what we wanted.  If for example,
we then do a facet on the label field, like in the test below, we're
getting some terms in the response that we would not like to be there.


// test case (groovy)
void specialHtmlConstructsGetStripped() {
    SolrInputDocument inputDocument = new SolrInputDocument()
    inputDocument.addField('label', 'Bose&#174; &#8482;')

    solrServer.add(inputDocument)
    solrServer.commit()

    QueryResponse response = solrServer.query(new SolrQuery('bose'))
    assert 1 == response.results.numFound

    SolrQuery facetQuery = new SolrQuery('bose')
    facetQuery.facet = true
    facetQuery.set(FacetParams.FACET_FIELD, 'label')
    facetQuery.set(FacetParams.FACET_MINCOUNT, '1')

    response = solrServer.query(facetQuery)
    FacetField ff = response.facetFields.find {it.name == 'label'}

    List suggestResponse = []

    for (FacetField.Count facetField in ff?.values) {
        suggestResponse << facetField.name
    }

    assert suggestResponse == ['bose']
}

With the upgrade to Solr4, the assertion fails, the suggested response
contains 174 and 8482 as terms.  Test output is:

Assertion failed:

assert suggestResponse == ['bose']
       |               |
       |               false
       [174, 8482, bose]


I just tried again using the latest build from today, namely:
https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
getting the failing assertion. Is there a different way to configure the
HTMLStripCharFilterFactory in Solr4?

Thanks in advance for any tips!

Mike

RE: HTMLStripCharFilterFactory not working in Solr4?

Posted by Michael Ryan <mr...@moreover.com>.

Try putting the HTMLStripCharFilterFactory before the StandardTokenizerFactory instead of after it. I vaguely recall being burned by something like this before.

-Michael

Re: HTMLStripCharFilterFactory not working in Solr4?

Posted by Mike Hugo <mi...@piragua.com>.

Thanks guys!  I'll grab the latest build from the solr4 jenkins server when
those commits get picked up and try it out.  Thanks for the quick
turnaround!

Mike

On Wed, Jan 25, 2012 at 11:01 AM, Steven A Rowe <sa...@syr.edu> wrote:

> Hi Mike,
>
> Yonik committed a fix to Solr trunk - your test on LUCENE-3721 succeeds
> for me now.  (On Solr trunk, *all* CharFilters have been non-functional
> since LUCENE-3396 was committed in r1175297 on 25 Sept 2011, until Yonik's
> fix today in r1235810; Solr 3.x was not affected - CharFilters have been
> working there all along.)
>
> Steve
>
> > -----Original Message-----
> > From: Mike Hugo [mailto:mike@piragua.com]
> > Sent: Tuesday, January 24, 2012 3:56 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: HTMLStripCharFilterFactory not working in Solr4?
> >
> > Thanks for the responses everyone.
> >
> > Steve, the test method you provided also works for me.  However, when I
> > try
> > a more end to end test with the HTMLStripCharFilterFactory configured for
> > a
> > field I am still having the same problem.  I attached a failing unit test
> > and configuration to the following issue in JIRA:
> >
> > https://issues.apache.org/jira/browse/LUCENE-3721
> >
> > I appreciate all the prompt responses!  Looking forward to finding the
> > root
> > cause of this guy :)  If there's something I'm doing incorrectly in the
> > configuration, please let me know!
> >
> > Mike
> >
> > On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe <sa...@syr.edu> wrote:
> >
> > > Hi Mike,
> > >
> > > When I add the following test to TestHTMLStripCharFilterFactory.java on
> > > Solr trunk, it passes:
> > >
> > > public void testNumericCharacterEntities() throws Exception {
> > >  final String text = "Bose&#174; &#8482;";  // |Bose® ™|
> > >  HTMLStripCharFilterFactory htmlStripFactory = new
> > > HTMLStripCharFilterFactory();
> > >  htmlStripFactory.init(Collections.<String,String>emptyMap());
> > >  CharStream charStream = htmlStripFactory.create(CharReader.get(new
> > > StringReader(text)));
> > >  StandardTokenizerFactory stdTokFactory = new
> > StandardTokenizerFactory();
> > >  stdTokFactory.init(DEFAULT_VERSION_PARAM);
> > >  Tokenizer stream = stdTokFactory.create(charStream);
> > >  assertTokenStreamContents(stream, new String[] { "Bose" });
> > > }
> > >
> > > What's happening:
> > >
> > > First, htmlStripFactory converts "&#174;" to "®" and "&#8482;" to "™".
> > >  Then stdTokFactory declines to tokenize "®" and "™", because they are
> > > belong to the Unicode general category "Symbol, Other", and so are not
> > > included in any of the output tokens.
> > >
> > > StandardTokenizer uses the Word Break rules find UAX#29 <
> > > http://unicode.org/reports/tr29/> to find token boundaries, and then
> > > outputs only alphanumeric tokens.  See the JFlex grammar for details: <
> > >
> >
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/
> >
> java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=
> > markup
> > > >.
> > >
> > > The behavior you're seeing is not consistent with the above test.
> > >
> > > Steve
> > >
> > > > -----Original Message-----
> > > > From: Mike Hugo [mailto:mike@piragua.com]
> > > > Sent: Tuesday, January 24, 2012 1:34 PM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: HTMLStripCharFilterFactory not working in Solr4?
> > > >
> > > > We recently updated to the latest build of Solr4 and everything is
> > > working
> > > > really well so far!  There is one case that is not working the same
> > way
> > > it
> > > > was in Solr 3.4 - we strip out certain HTML constructs (like
> trademark
> > > and
> > > > registered, for example) in a field as defined below - it was working
> > in
> > > > Solr3.4 with the configuration shown here, but is not working the
> same
> > > way
> > > > in Solr4.
> > > >
> > > > The label field is defined as type="text_general"
> > > > <field name="label" type="text_general" indexed="true" stored="false"
> > > > required="false" multiValued="true"/>
> > > >
> > > > Here's the type definition for text_general field:
> > > > <fieldType name="text_general" class="solr.TextField"
> > > > positionIncrementGap="100">
> > > >             <analyzer type="index">
> > > >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > > >                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > > words="stopwords.txt"
> > > >                         enablePositionIncrements="true"/>
> > > >                 <filter class="solr.LowerCaseFilterFactory"/>
> > > >             </analyzer>
> > > >             <analyzer type="query">
> > > >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> > > >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > > >                 <filter class="solr.StopFilterFactory"
> > ignoreCase="true"
> > > > words="stopwords.txt"
> > > >                         enablePositionIncrements="true"/>
> > > >                 <filter class="solr.LowerCaseFilterFactory"/>
> > > >             </analyzer>
> > > >         </fieldType>
> > > >
> > > >
> > > > In Solr 3.4, that configuration was completely stripping html
> > constructs
> > > > out of the indexed field which is exactly what we wanted.  If for
> > > example,
> > > > we then do a facet on the label field, like in the test below, we're
> > > > getting some terms in the response that we would not like to be
> there.
> > > >
> > > >
> > > > // test case (groovy)
> > > > void specialHtmlConstructsGetStripped() {
> > > >     SolrInputDocument inputDocument = new SolrInputDocument()
> > > >     inputDocument.addField('label', 'Bose&#174; &#8482;')
> > > >
> > > >     solrServer.add(inputDocument)
> > > >     solrServer.commit()
> > > >
> > > >     QueryResponse response = solrServer.query(new SolrQuery('bose'))
> > > >     assert 1 == response.results.numFound
> > > >
> > > >     SolrQuery facetQuery = new SolrQuery('bose')
> > > >     facetQuery.facet = true
> > > >     facetQuery.set(FacetParams.FACET_FIELD, 'label')
> > > >     facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> > > >
> > > >     response = solrServer.query(facetQuery)
> > > >     FacetField ff = response.facetFields.find {it.name == 'label'}
> > > >
> > > >     List suggestResponse = []
> > > >
> > > >     for (FacetField.Count facetField in ff?.values) {
> > > >         suggestResponse << facetField.name
> > > >     }
> > > >
> > > >     assert suggestResponse == ['bose']
> > > > }
> > > >
> > > > With the upgrade to Solr4, the assertion fails, the suggested
> response
> > > > contains 174 and 8482 as terms.  Test output is:
> > > >
> > > > Assertion failed:
> > > >
> > > > assert suggestResponse == ['bose']
> > > >        |               |
> > > >        |               false
> > > >        [174, 8482, bose]
> > > >
> > > >
> > > > I just tried again using the latest build from today, namely:
> > > > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> > > still
> > > > getting the failing assertion. Is there a different way to configure
> > the
> > > > HTMLStripCharFilterFactory in Solr4?
> > > >
> > > > Thanks in advance for any tips!
> > > >
> > > > Mike
> > >
>

RE: HTMLStripCharFilterFactory not working in Solr4?

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Mike,

Yonik committed a fix to Solr trunk - your test on LUCENE-3721 succeeds for me now.  (On Solr trunk, *all* CharFilters have been non-functional since LUCENE-3396 was committed in r1175297 on 25 Sept 2011, until Yonik's fix today in r1235810; Solr 3.x was not affected - CharFilters have been working there all along.)

Steve

> -----Original Message-----
> From: Mike Hugo [mailto:mike@piragua.com]
> Sent: Tuesday, January 24, 2012 3:56 PM
> To: solr-user@lucene.apache.org
> Subject: Re: HTMLStripCharFilterFactory not working in Solr4?
> 
> Thanks for the responses everyone.
> 
> Steve, the test method you provided also works for me.  However, when I
> try
> a more end to end test with the HTMLStripCharFilterFactory configured for
> a
> field I am still having the same problem.  I attached a failing unit test
> and configuration to the following issue in JIRA:
> 
> https://issues.apache.org/jira/browse/LUCENE-3721
> 
> I appreciate all the prompt responses!  Looking forward to finding the
> root
> cause of this guy :)  If there's something I'm doing incorrectly in the
> configuration, please let me know!
> 
> Mike
> 
> On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe <sa...@syr.edu> wrote:
> 
> > Hi Mike,
> >
> > When I add the following test to TestHTMLStripCharFilterFactory.java on
> > Solr trunk, it passes:
> >
> > public void testNumericCharacterEntities() throws Exception {
> >  final String text = "Bose&#174; &#8482;";  // |Bose® ™|
> >  HTMLStripCharFilterFactory htmlStripFactory = new
> > HTMLStripCharFilterFactory();
> >  htmlStripFactory.init(Collections.<String,String>emptyMap());
> >  CharStream charStream = htmlStripFactory.create(CharReader.get(new
> > StringReader(text)));
> >  StandardTokenizerFactory stdTokFactory = new
> StandardTokenizerFactory();
> >  stdTokFactory.init(DEFAULT_VERSION_PARAM);
> >  Tokenizer stream = stdTokFactory.create(charStream);
> >  assertTokenStreamContents(stream, new String[] { "Bose" });
> > }
> >
> > What's happening:
> >
> > First, htmlStripFactory converts "&#174;" to "®" and "&#8482;" to "™".
> >  Then stdTokFactory declines to tokenize "®" and "™", because they are
> > belong to the Unicode general category "Symbol, Other", and so are not
> > included in any of the output tokens.
> >
> > StandardTokenizer uses the Word Break rules find UAX#29 <
> > http://unicode.org/reports/tr29/> to find token boundaries, and then
> > outputs only alphanumeric tokens.  See the JFlex grammar for details: <
> >
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/
> java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=
> markup
> > >.
> >
> > The behavior you're seeing is not consistent with the above test.
> >
> > Steve
> >
> > > -----Original Message-----
> > > From: Mike Hugo [mailto:mike@piragua.com]
> > > Sent: Tuesday, January 24, 2012 1:34 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: HTMLStripCharFilterFactory not working in Solr4?
> > >
> > > We recently updated to the latest build of Solr4 and everything is
> > working
> > > really well so far!  There is one case that is not working the same
> way
> > it
> > > was in Solr 3.4 - we strip out certain HTML constructs (like trademark
> > and
> > > registered, for example) in a field as defined below - it was working
> in
> > > Solr3.4 with the configuration shown here, but is not working the same
> > way
> > > in Solr4.
> > >
> > > The label field is defined as type="text_general"
> > > <field name="label" type="text_general" indexed="true" stored="false"
> > > required="false" multiValued="true"/>
> > >
> > > Here's the type definition for text_general field:
> > > <fieldType name="text_general" class="solr.TextField"
> > > positionIncrementGap="100">
> > >             <analyzer type="index">
> > >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> > >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > > words="stopwords.txt"
> > >                         enablePositionIncrements="true"/>
> > >                 <filter class="solr.LowerCaseFilterFactory"/>
> > >             </analyzer>
> > >             <analyzer type="query">
> > >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> > >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > >                 <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> > > words="stopwords.txt"
> > >                         enablePositionIncrements="true"/>
> > >                 <filter class="solr.LowerCaseFilterFactory"/>
> > >             </analyzer>
> > >         </fieldType>
> > >
> > >
> > > In Solr 3.4, that configuration was completely stripping html
> constructs
> > > out of the indexed field which is exactly what we wanted.  If for
> > example,
> > > we then do a facet on the label field, like in the test below, we're
> > > getting some terms in the response that we would not like to be there.
> > >
> > >
> > > // test case (groovy)
> > > void specialHtmlConstructsGetStripped() {
> > >     SolrInputDocument inputDocument = new SolrInputDocument()
> > >     inputDocument.addField('label', 'Bose&#174; &#8482;')
> > >
> > >     solrServer.add(inputDocument)
> > >     solrServer.commit()
> > >
> > >     QueryResponse response = solrServer.query(new SolrQuery('bose'))
> > >     assert 1 == response.results.numFound
> > >
> > >     SolrQuery facetQuery = new SolrQuery('bose')
> > >     facetQuery.facet = true
> > >     facetQuery.set(FacetParams.FACET_FIELD, 'label')
> > >     facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> > >
> > >     response = solrServer.query(facetQuery)
> > >     FacetField ff = response.facetFields.find {it.name == 'label'}
> > >
> > >     List suggestResponse = []
> > >
> > >     for (FacetField.Count facetField in ff?.values) {
> > >         suggestResponse << facetField.name
> > >     }
> > >
> > >     assert suggestResponse == ['bose']
> > > }
> > >
> > > With the upgrade to Solr4, the assertion fails, the suggested response
> > > contains 174 and 8482 as terms.  Test output is:
> > >
> > > Assertion failed:
> > >
> > > assert suggestResponse == ['bose']
> > >        |               |
> > >        |               false
> > >        [174, 8482, bose]
> > >
> > >
> > > I just tried again using the latest build from today, namely:
> > > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> > still
> > > getting the failing assertion. Is there a different way to configure
> the
> > > HTMLStripCharFilterFactory in Solr4?
> > >
> > > Thanks in advance for any tips!
> > >
> > > Mike
> >

Re: HTMLStripCharFilterFactory not working in Solr4?

Posted by Mike Hugo <mi...@piragua.com>.

Thanks for the responses everyone.

Steve, the test method you provided also works for me.  However, when I try
a more end to end test with the HTMLStripCharFilterFactory configured for a
field I am still having the same problem.  I attached a failing unit test
and configuration to the following issue in JIRA:

https://issues.apache.org/jira/browse/LUCENE-3721

I appreciate all the prompt responses!  Looking forward to finding the root
cause of this guy :)  If there's something I'm doing incorrectly in the
configuration, please let me know!

Mike

On Tue, Jan 24, 2012 at 1:57 PM, Steven A Rowe <sa...@syr.edu> wrote:

> Hi Mike,
>
> When I add the following test to TestHTMLStripCharFilterFactory.java on
> Solr trunk, it passes:
>
> public void testNumericCharacterEntities() throws Exception {
>  final String text = "Bose&#174; &#8482;";  // |Bose® ™|
>  HTMLStripCharFilterFactory htmlStripFactory = new
> HTMLStripCharFilterFactory();
>  htmlStripFactory.init(Collections.<String,String>emptyMap());
>  CharStream charStream = htmlStripFactory.create(CharReader.get(new
> StringReader(text)));
>  StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory();
>  stdTokFactory.init(DEFAULT_VERSION_PARAM);
>  Tokenizer stream = stdTokFactory.create(charStream);
>  assertTokenStreamContents(stream, new String[] { "Bose" });
> }
>
> What's happening:
>
> First, htmlStripFactory converts "&#174;" to "®" and "&#8482;" to "™".
>  Then stdTokFactory declines to tokenize "®" and "™", because they are
> belong to the Unicode general category "Symbol, Other", and so are not
> included in any of the output tokens.
>
> StandardTokenizer uses the Word Break rules find UAX#29 <
> http://unicode.org/reports/tr29/> to find token boundaries, and then
> outputs only alphanumeric tokens.  See the JFlex grammar for details: <
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup
> >.
>
> The behavior you're seeing is not consistent with the above test.
>
> Steve
>
> > -----Original Message-----
> > From: Mike Hugo [mailto:mike@piragua.com]
> > Sent: Tuesday, January 24, 2012 1:34 PM
> > To: solr-user@lucene.apache.org
> > Subject: HTMLStripCharFilterFactory not working in Solr4?
> >
> > We recently updated to the latest build of Solr4 and everything is
> working
> > really well so far!  There is one case that is not working the same way
> it
> > was in Solr 3.4 - we strip out certain HTML constructs (like trademark
> and
> > registered, for example) in a field as defined below - it was working in
> > Solr3.4 with the configuration shown here, but is not working the same
> way
> > in Solr4.
> >
> > The label field is defined as type="text_general"
> > <field name="label" type="text_general" indexed="true" stored="false"
> > required="false" multiValued="true"/>
> >
> > Here's the type definition for text_general field:
> > <fieldType name="text_general" class="solr.TextField"
> > positionIncrementGap="100">
> >             <analyzer type="index">
> >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"
> >                         enablePositionIncrements="true"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >             </analyzer>
> >             <analyzer type="query">
> >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> >                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"
> >                         enablePositionIncrements="true"/>
> >                 <filter class="solr.LowerCaseFilterFactory"/>
> >             </analyzer>
> >         </fieldType>
> >
> >
> > In Solr 3.4, that configuration was completely stripping html constructs
> > out of the indexed field which is exactly what we wanted.  If for
> example,
> > we then do a facet on the label field, like in the test below, we're
> > getting some terms in the response that we would not like to be there.
> >
> >
> > // test case (groovy)
> > void specialHtmlConstructsGetStripped() {
> >     SolrInputDocument inputDocument = new SolrInputDocument()
> >     inputDocument.addField('label', 'Bose&#174; &#8482;')
> >
> >     solrServer.add(inputDocument)
> >     solrServer.commit()
> >
> >     QueryResponse response = solrServer.query(new SolrQuery('bose'))
> >     assert 1 == response.results.numFound
> >
> >     SolrQuery facetQuery = new SolrQuery('bose')
> >     facetQuery.facet = true
> >     facetQuery.set(FacetParams.FACET_FIELD, 'label')
> >     facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> >
> >     response = solrServer.query(facetQuery)
> >     FacetField ff = response.facetFields.find {it.name == 'label'}
> >
> >     List suggestResponse = []
> >
> >     for (FacetField.Count facetField in ff?.values) {
> >         suggestResponse << facetField.name
> >     }
> >
> >     assert suggestResponse == ['bose']
> > }
> >
> > With the upgrade to Solr4, the assertion fails, the suggested response
> > contains 174 and 8482 as terms.  Test output is:
> >
> > Assertion failed:
> >
> > assert suggestResponse == ['bose']
> >        |               |
> >        |               false
> >        [174, 8482, bose]
> >
> >
> > I just tried again using the latest build from today, namely:
> > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> still
> > getting the failing assertion. Is there a different way to configure the
> > HTMLStripCharFilterFactory in Solr4?
> >
> > Thanks in advance for any tips!
> >
> > Mike
>

RE: HTMLStripCharFilterFactory not working in Solr4?

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Mike,

When I add the following test to TestHTMLStripCharFilterFactory.java on Solr trunk, it passes:
  
public void testNumericCharacterEntities() throws Exception {
  final String text = "Bose&#174; &#8482;";  // |Bose® ™|
  HTMLStripCharFilterFactory htmlStripFactory = new HTMLStripCharFilterFactory();
  htmlStripFactory.init(Collections.<String,String>emptyMap());
  CharStream charStream = htmlStripFactory.create(CharReader.get(new StringReader(text)));
  StandardTokenizerFactory stdTokFactory = new StandardTokenizerFactory();
  stdTokFactory.init(DEFAULT_VERSION_PARAM);
  Tokenizer stream = stdTokFactory.create(charStream);
  assertTokenStreamContents(stream, new String[] { "Bose" });
}

What's happening: 

First, htmlStripFactory converts "&#174;" to "®" and "&#8482;" to "™".  Then stdTokFactory declines to tokenize "®" and "™", because they are belong to the Unicode general category "Symbol, Other", and so are not included in any of the output tokens.

StandardTokenizer uses the Word Break rules find UAX#29 <http://unicode.org/reports/tr29/> to find token boundaries, and then outputs only alphanumeric tokens.  See the JFlex grammar for details: <http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup>.

The behavior you're seeing is not consistent with the above test.

Steve

> -----Original Message-----
> From: Mike Hugo [mailto:mike@piragua.com]
> Sent: Tuesday, January 24, 2012 1:34 PM
> To: solr-user@lucene.apache.org
> Subject: HTMLStripCharFilterFactory not working in Solr4?
> 
> We recently updated to the latest build of Solr4 and everything is working
> really well so far!  There is one case that is not working the same way it
> was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
> registered, for example) in a field as defined below - it was working in
> Solr3.4 with the configuration shown here, but is not working the same way
> in Solr4.
> 
> The label field is defined as type="text_general"
> <field name="label" type="text_general" indexed="true" stored="false"
> required="false" multiValued="true"/>
> 
> Here's the type definition for text_general field:
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>             <analyzer type="index">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                         enablePositionIncrements="true"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                         enablePositionIncrements="true"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>             </analyzer>
>         </fieldType>
> 
> 
> In Solr 3.4, that configuration was completely stripping html constructs
> out of the indexed field which is exactly what we wanted.  If for example,
> we then do a facet on the label field, like in the test below, we're
> getting some terms in the response that we would not like to be there.
> 
> 
> // test case (groovy)
> void specialHtmlConstructsGetStripped() {
>     SolrInputDocument inputDocument = new SolrInputDocument()
>     inputDocument.addField('label', 'Bose&#174; &#8482;')
> 
>     solrServer.add(inputDocument)
>     solrServer.commit()
> 
>     QueryResponse response = solrServer.query(new SolrQuery('bose'))
>     assert 1 == response.results.numFound
> 
>     SolrQuery facetQuery = new SolrQuery('bose')
>     facetQuery.facet = true
>     facetQuery.set(FacetParams.FACET_FIELD, 'label')
>     facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> 
>     response = solrServer.query(facetQuery)
>     FacetField ff = response.facetFields.find {it.name == 'label'}
> 
>     List suggestResponse = []
> 
>     for (FacetField.Count facetField in ff?.values) {
>         suggestResponse << facetField.name
>     }
> 
>     assert suggestResponse == ['bose']
> }
> 
> With the upgrade to Solr4, the assertion fails, the suggested response
> contains 174 and 8482 as terms.  Test output is:
> 
> Assertion failed:
> 
> assert suggestResponse == ['bose']
>        |               |
>        |               false
>        [174, 8482, bose]
> 
> 
> I just tried again using the latest build from today, namely:
> https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
> getting the failing assertion. Is there a different way to configure the
> HTMLStripCharFilterFactory in Solr4?
> 
> Thanks in advance for any tips!
> 
> Mike

Re: HTMLStripCharFilterFactory not working in Solr4?

Posted by Yonik Seeley <yo...@lucidimagination.com>.

Oops, I didn't read carefully enough to see that you wanted those constructs
entirely stripped out.

Given that you're seeing numbers indexed, this strongly indicates an
escaping bug in the SolrJ client that must have been introduced at
some point.
I'll see if I can reproduce it in a unit test.


-Yonik
http://www.lucidimagination.com

Re: HTMLStripCharFilterFactory not working in Solr4?

Posted by Mike Hugo <mi...@piragua.com>.

Thanks for the response Yonik,
Interestingly enough, changing to to the LegacyHTMLStripCharFilterFactory
does NOT solve the problem - in fact I get the same result

I can see that the LegacyHTMLStripCharFilterFactory is being applied at
startup:

Jan 24, 2012 1:25:29 PM org.apache.solr.util.plugin.AbstractPluginLoader
load
INFO: created : org.apache.solr.analysis.LegacyHTMLStripCharFilterFactory

however, I'm still getting the same assertion error.  Any thoughts?

Mike


On Tue, Jan 24, 2012 at 12:40 PM, Yonik Seeley
<yo...@lucidimagination.com>wrote:

> You can use LegacyHTMLStripCharFilterFactory to get the previous behavior.
> See https://issues.apache.org/jira/browse/LUCENE-3690 for more details.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo <mi...@piragua.com> wrote:
> > We recently updated to the latest build of Solr4 and everything is
> working
> > really well so far!  There is one case that is not working the same way
> it
> > was in Solr 3.4 - we strip out certain HTML constructs (like trademark
> and
> > registered, for example) in a field as defined below - it was working in
> > Solr3.4 with the configuration shown here, but is not working the same
> way
> > in Solr4.
> >
> > The label field is defined as type="text_general"
> > <field name="label" type="text_general" indexed="true" stored="false"
> > required="false" multiValued="true"/>
> >
> > Here's the type definition for text_general field:
> > <fieldType name="text_general" class="solr.TextField"
> > positionIncrementGap="100">
> >            <analyzer type="index">
> >                <tokenizer class="solr.StandardTokenizerFactory"/>
> >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"
> >                        enablePositionIncrements="true"/>
> >                <filter class="solr.LowerCaseFilterFactory"/>
> >            </analyzer>
> >            <analyzer type="query">
> >                <tokenizer class="solr.StandardTokenizerFactory"/>
> >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"
> >                        enablePositionIncrements="true"/>
> >                <filter class="solr.LowerCaseFilterFactory"/>
> >            </analyzer>
> >        </fieldType>
> >
> >
> > In Solr 3.4, that configuration was completely stripping html constructs
> > out of the indexed field which is exactly what we wanted.  If for
> example,
> > we then do a facet on the label field, like in the test below, we're
> > getting some terms in the response that we would not like to be there.
> >
> >
> > // test case (groovy)
> > void specialHtmlConstructsGetStripped() {
> >    SolrInputDocument inputDocument = new SolrInputDocument()
> >    inputDocument.addField('label', 'Bose&#174; &#8482;')
> >
> >    solrServer.add(inputDocument)
> >    solrServer.commit()
> >
> >    QueryResponse response = solrServer.query(new SolrQuery('bose'))
> >    assert 1 == response.results.numFound
> >
> >    SolrQuery facetQuery = new SolrQuery('bose')
> >    facetQuery.facet = true
> >    facetQuery.set(FacetParams.FACET_FIELD, 'label')
> >    facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
> >
> >    response = solrServer.query(facetQuery)
> >    FacetField ff = response.facetFields.find {it.name == 'label'}
> >
> >    List suggestResponse = []
> >
> >    for (FacetField.Count facetField in ff?.values) {
> >        suggestResponse << facetField.name
> >    }
> >
> >    assert suggestResponse == ['bose']
> > }
> >
> > With the upgrade to Solr4, the assertion fails, the suggested response
> > contains 174 and 8482 as terms.  Test output is:
> >
> > Assertion failed:
> >
> > assert suggestResponse == ['bose']
> >       |               |
> >       |               false
> >       [174, 8482, bose]
> >
> >
> > I just tried again using the latest build from today, namely:
> > https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're
> still
> > getting the failing assertion. Is there a different way to configure the
> > HTMLStripCharFilterFactory in Solr4?
> >
> > Thanks in advance for any tips!
> >
> > Mike
>

Re: HTMLStripCharFilterFactory not working in Solr4?

Posted by Yonik Seeley <yo...@lucidimagination.com>.

You can use LegacyHTMLStripCharFilterFactory to get the previous behavior.
See https://issues.apache.org/jira/browse/LUCENE-3690 for more details.

-Yonik
http://www.lucidimagination.com



On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo <mi...@piragua.com> wrote:
> We recently updated to the latest build of Solr4 and everything is working
> really well so far!  There is one case that is not working the same way it
> was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
> registered, for example) in a field as defined below - it was working in
> Solr3.4 with the configuration shown here, but is not working the same way
> in Solr4.
>
> The label field is defined as type="text_general"
> <field name="label" type="text_general" indexed="true" stored="false"
> required="false" multiValued="true"/>
>
> Here's the type definition for text_general field:
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>            <analyzer type="index">
>                <tokenizer class="solr.StandardTokenizerFactory"/>
>                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                        enablePositionIncrements="true"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>            </analyzer>
>            <analyzer type="query">
>                <tokenizer class="solr.StandardTokenizerFactory"/>
>                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                        enablePositionIncrements="true"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>            </analyzer>
>        </fieldType>
>
>
> In Solr 3.4, that configuration was completely stripping html constructs
> out of the indexed field which is exactly what we wanted.  If for example,
> we then do a facet on the label field, like in the test below, we're
> getting some terms in the response that we would not like to be there.
>
>
> // test case (groovy)
> void specialHtmlConstructsGetStripped() {
>    SolrInputDocument inputDocument = new SolrInputDocument()
>    inputDocument.addField('label', 'Bose&#174; &#8482;')
>
>    solrServer.add(inputDocument)
>    solrServer.commit()
>
>    QueryResponse response = solrServer.query(new SolrQuery('bose'))
>    assert 1 == response.results.numFound
>
>    SolrQuery facetQuery = new SolrQuery('bose')
>    facetQuery.facet = true
>    facetQuery.set(FacetParams.FACET_FIELD, 'label')
>    facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
>
>    response = solrServer.query(facetQuery)
>    FacetField ff = response.facetFields.find {it.name == 'label'}
>
>    List suggestResponse = []
>
>    for (FacetField.Count facetField in ff?.values) {
>        suggestResponse << facetField.name
>    }
>
>    assert suggestResponse == ['bose']
> }
>
> With the upgrade to Solr4, the assertion fails, the suggested response
> contains 174 and 8482 as terms.  Test output is:
>
> Assertion failed:
>
> assert suggestResponse == ['bose']
>       |               |
>       |               false
>       [174, 8482, bose]
>
>
> I just tried again using the latest build from today, namely:
> https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
> getting the failing assertion. Is there a different way to configure the
> HTMLStripCharFilterFactory in Solr4?
>
> Thanks in advance for any tips!
>
> Mike