You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Merlin Morgenstern <me...@googlemail.com> on 2011/07/25 13:09:14 UTC

strip html from data

Hi there,

I am trying to strip html tags from the data before adding the documents to
the index. To do that I altered schem.xml like this:

         <fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
                <analyzer type="index">
                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                    <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                    <filter class="solr.LowerCaseFilterFactory"/>
                    <filter class="solr.KeywordMarkerFilterFactory"/>
                    <filter class="solr.PorterStemFilterFactory"/>
                </analyzer>
                <analyzer type="query">
                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                    <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                    <filter class="solr.LowerCaseFilterFactory"/>
                    <filter class="solr.KeywordMarkerFilterFactory"/>
                    <filter class="solr.PorterStemFilterFactory"/>
                </analyzer>
                <analyzer>
                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                </analyzer>
         </fieldType>

    <fields>
        <field name="text" type="text" indexed="true" stored="true"
required="false"/>
    </fields>

Unfortunatelly this does not work, the hmtl tags like <h3> are still present
after restarting and reindexing. I also tryed htmlstriptransformer, but this
did not work either.

Has anybody an idea how to get this done? Thank you in advance for any hint.

Merlin

Re: strip html from data

Posted by Merlin Morgenstern <me...@googlemail.com>.

2011/8/11 Ahmet Arslan <io...@yahoo.com>

> > Is there a way to strip the html tags completly and not
> > index them? If not,
> > how to I retrieve the results without html tags?
>
> How do you push documents to solr? You need to strip html tags before the
> analysis chain. For example, if you are using Data Import Handler, you can
> use HTMLStripTransformer.
>
>  http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer
>

Thank you everybody for your help and all the detailed explanations. This
solution fixed the problem.

Best regards.

Re: strip html from data

Posted by Erick Erickson <er...@gmail.com>.

Right, this is expected behavior, it trips a lot of people up.

When you specify ' indexed="true" ' in your field definitions, the
contents of the input stream are put into the inverted index etc, *after*
all the transformations you specify via tokenizers, filters, charFilters,
etc are applied. In this case, if you specify something like
"HtmlStripCharFilterFactory",
all the HTML markup is removed from the terms that are put in the
inverted index. Those are the values you see when using the terms component.
If you stem, for instance, you'll see values like "stori" for "story"
or some such. So
let's take some specific input <b>story board</b>. The only thing indexed
(assuming you stripped the HTML , white-space tokenized) would be
"story" and "board". <b> and </b> would not show up in TermsComponent
since that is looking at tokens in the inverted index.

This has absolutely nothing to do with setting ' stored="true" '. What happens
here is that the raw input stream is just put in a different file than
the inverted index.
So <b>story board</b> is put there.

When you specify &fl=<field>, the *stored* value is returned, and you'd see
<b>story board</b>.

Why, you ask, is it done this way? Well, the idea of stored data is you want to
see the original. Imagine returning something to the user with casing changes,
stemming changes, stop words removed, etc. Gibberish. Thus the bifurcation.

Hope this helps
Erick

On Thu, Aug 11, 2011 at 4:19 AM, Merlin Morgenstern
<me...@googlemail.com> wrote:
> I am sorry, but I do not really understand the difference of indexed and
> returned result set.
>
> I look on the "returned" dataset via this command:
> solr/select/?q=id:533563&terms=true
>
> which gives me html tags like this ones: </b><br />
>
> I also tried to turn on TermsComponent, but it did not change anything:
> solr/select/?q=id:533563&terms=true
>
> The shema browser does not show any html tags inside the text field, just
> indexed words of the one dataset.
>
> Is there a way to strip the html tags completly and not index them? If not,
> how to I retrieve the results without html tags?
>
> Thank you for your help.
>
>
>
> 2011/8/9 Erick Erickson <er...@gmail.com>
>
>> OK, what does "not working" mean? You never answered Markus' question:
>>
>> "Are you looking at the returned result set or what you've actually
>> indexed?
>> Analyzers are not run on the stored data, only on indexed data."
>>
>> If "not working" means that your returned results contain the markup, then
>> you're confusing indexing and storing. All the analysis chains operate
>> on data sent into the indexing process. But the verbatim data is *stored*
>> prior to (or separate from) indexing.
>>
>> So my assumption is that you see data returned in the document with
>> markup, which is just as it should be, and there's no problem at all. And
>> your
>> actual indexed terms (try looking at the data with TermsComponent, or
>> admin/schema browser) will NOT have any markup.
>>
>> Perhaps you can back up a bit and describe what's failing .vs. what you
>> expect.
>>
>> Best
>> Erick
>>
>> On Mon, Aug 8, 2011 at 6:50 AM, Merlin Morgenstern
>> <me...@googlemail.com> wrote:
>> > Unfortunatelly I still cant get it running. The code I am using is the
>> > following:
>> >                <analyzer type="index">
>> >                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
>> >                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >                    <filter class="solr.WordDelimiterFilterFactory"
>> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> >                    <filter class="solr.LowerCaseFilterFactory"/>
>> >                    <filter class="solr.KeywordMarkerFilterFactory"/>
>> >                    <filter class="solr.PorterStemFilterFactory"/>
>> >                </analyzer>
>> >                <analyzer type="query">
>> >                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
>> >                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >                    <filter class="solr.WordDelimiterFilterFactory"
>> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>> >                    <filter class="solr.LowerCaseFilterFactory"/>
>> >                    <filter class="solr.KeywordMarkerFilterFactory"/>
>> >                    <filter class="solr.PorterStemFilterFactory"/>
>> >                </analyzer>
>> >
>> > I also tried this one:
>> >
>> >    <types>
>> >         <fieldType name="text" class="solr.TextField"
>> > positionIncrementGap="100" autoGeneratePhraseQueries="true">
>> >               <analyzer>
>> >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>> >                  <tokenizer class="solr.StandardTokenizerFactory"/>
>> >                  <filter class="solr.StandardFilterFactory"/>
>> >            </analyzer>
>> >         </fieldType>
>> >    </types>
>> >      <field name="text" type="text" indexed="true" stored="true"
>> > required="false"/>
>> >
>> > none of those worked. I restartred solr after the shema update and
>> reindexed
>> > the data. No change, the html tags are still in there.
>> >
>> > Any other ideas? Maybe this is a bug in solr? I am using solr 3.3.0 on
>> suse
>> > linux.
>> >
>> > Thank you for any help on this.
>> >
>> >
>> >
>> > 2011/7/25 Mike Sokolov <so...@ifactory.com>
>> >
>> >> Hmm that looks like it's working fine.  I stand corrected.
>> >>
>> >>
>> >>
>> >> On 07/25/2011 12:24 PM, Markus Jelsma wrote:
>> >>
>> >>> I've seen that issue too and read comments on the list yet i've never
>> had
>> >>> trouble with the order, don't know what's going on. Check this
>> analyzer,
>> >>> i've
>> >>> moved the charFilter to the bottom:
>> >>>
>> >>> <analyzer type="index">
>> >>> <tokenizer class="solr.**WhitespaceTokenizerFactory"/>
>> >>> <filter class="solr.**WordDelimiterFilterFactory" generateWordParts="1"
>> >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> >>> catenateAll="0"
>> >>> splitOnCaseChange="1"/>
>> >>> <filter class="solr.**LowerCaseFilterFactory"/>
>> >>> <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt"
>> >>> ignoreCase="false" expand="true"/>
>> >>> <filter class="solr.StopFilterFactory" ignoreCase="false"
>> >>> words="stopwords.txt"/>
>> >>> <filter class="solr.**ASCIIFoldingFilterFactory"/>
>> >>> <filter class="solr.**SnowballPorterFilterFactory"
>> >>> protected="protwords.txt"
>> >>> language="Dutch"/>
>> >>> <filter class="solr.**RemoveDuplicatesTokenFilterFac**tory"/>
>> >>> <charFilter class="solr.**HTMLStripCharFilterFactory"/>
>> >>> </analyzer>
>> >>>
>> >>> The analysis chain still does its job as i expect for the input:
>> >>> <span>bla bla</span>
>> >>>
>> >>> Index Analyzer
>> >>> org.apache.solr.analysis.**HTMLStripCharFilterFactory
>> >>> {luceneMatchVersion=LUCENE_34}
>> >>> text    bla bla
>> >>> org.apache.solr.analysis.**WhitespaceTokenizerFactory
>> >>> {luceneMatchVersion=LUCENE_34}
>> >>> position        1       2
>> >>> term text       bla     bla
>> >>> startOffset     6       10
>> >>> endOffset       9       13
>> >>> org.apache.solr.analysis.**WordDelimiterFilterFactory
>> >>> {splitOnCaseChange=1,
>> >>> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34,
>> >>> generateWordParts=1, catenateAll=0, catenateNumbers=1}
>> >>> position        1       2
>> >>> term text       bla     bla
>> >>> startOffset     6       10
>> >>> endOffset       9       13
>> >>> type    word    word
>> >>> org.apache.solr.analysis.**LowerCaseFilterFactory
>> >>> {luceneMatchVersion=LUCENE_34}
>> >>> position        1       2
>> >>> term text       bla     bla
>> >>> startOffset     6       10
>> >>> endOffset       9       13
>> >>> type    word    word
>> >>> org.apache.solr.analysis.**SynonymFilterFactory {synonyms=synonyms.txt,
>> >>> expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34}
>> >>> position        1       2
>> >>> term text       bla     bla
>> >>> type    word    word
>> >>> startOffset     6       10
>> >>> endOffset       9       13
>> >>> org.apache.solr.analysis.**StopFilterFactory {words=stopwords.txt,
>> >>> ignoreCase=false, luceneMatchVersion=LUCENE_34}
>> >>> position        1       2
>> >>> term text       bla     bla
>> >>> type    word    word
>> >>> startOffset     6       10
>> >>> endOffset       9       13
>> >>> org.apache.solr.analysis.**ASCIIFoldingFilterFactory
>> >>> {luceneMatchVersion=LUCENE_34}
>> >>> position        1       2
>> >>> term text       bla     bla
>> >>> type    word    word
>> >>> startOffset     6       10
>> >>> endOffset       9       13
>> >>> org.apache.solr.analysis.**SnowballPorterFilterFactory
>> >>> {protected=protwords.txt,
>> >>> language=Dutch, luceneMatchVersion=LUCENE_34}
>> >>> position        1       2
>> >>> term text       bla     bla
>> >>> keyword         false   false
>> >>> type    word    word
>> >>> startOffset     6       10
>> >>> endOffset       9       13
>> >>> org.apache.solr.analysis.**RemoveDuplicatesTokenFilterFac**tory
>> >>> {luceneMatchVersion=LUCENE_34}
>> >>> position        1       2
>> >>> term text       bla     bla
>> >>> keyword         false   false
>> >>> type    word    word
>> >>> startOffset     6       10
>> >>> endOffset       9       13
>> >>>
>> >>>
>> >>> On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:
>> >>>
>> >>>
>> >>>> Hmm - I'm not sure about that; see
>> >>>> https://issues.apache.org/**jira/browse/SOLR-2119<
>> https://issues.apache.org/jira/browse/SOLR-2119>
>> >>>>
>> >>>> On 07/25/2011 12:01 PM, Markus Jelsma wrote:
>> >>>>
>> >>>>
>> >>>>> charFilters are executed first regardless of their position in the
>> >>>>> analyzer.
>> >>>>>
>> >>>>> On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
>> >>>>>
>> >>>>>
>> >>>>>> I think you need to list the charfilter earlier in the analysis
>> chain;
>> >>>>>> before the tokenizer.  Porbably Solr should tell you this...
>> >>>>>>
>> >>>>>> -Mike
>> >>>>>>
>> >>>>>> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
>> >>>>>>
>> >>>>>>
>> >>>>>>> sounds logical. I just changed it to the following, restarted and
>> >>>>>>> reindexed
>> >>>>>>>
>> >>>>>>> with commit:
>> >>>>>>>            <fieldType name="text" class="solr.TextField"
>> >>>>>>>
>> >>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
>> >>>>>>>
>> >>>>>>>                   <analyzer type="index">
>> >>>>>>>
>> >>>>>>>                       <tokenizer
>> >>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>> >>>>>>>                       <filter class="solr.**
>> >>>>>>> WordDelimiterFilterFactory"
>> >>>>>>>
>> >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> >>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> >>>>>>>
>> >>>>>>>                       <filter
>> class="solr.**LowerCaseFilterFactory"/>
>> >>>>>>>                       <filter class="solr.**
>> >>>>>>> KeywordMarkerFilterFactory"/>
>> >>>>>>>                       <filter class="solr.**
>> >>>>>>> PorterStemFilterFactory"/>
>> >>>>>>>                       <charFilter
>> >>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
>> >>>>>>>
>> >>>>>>>                   </analyzer>
>> >>>>>>>                   <analyzer type="query">
>> >>>>>>>
>> >>>>>>>                       <tokenizer
>> >>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>> >>>>>>>                       <filter class="solr.**
>> >>>>>>> WordDelimiterFilterFactory"
>> >>>>>>>
>> >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> >>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>> >>>>>>>
>> >>>>>>>                       <filter
>> class="solr.**LowerCaseFilterFactory"/>
>> >>>>>>>                       <filter class="solr.**
>> >>>>>>> KeywordMarkerFilterFactory"/>
>> >>>>>>>                       <filter class="solr.**
>> >>>>>>> PorterStemFilterFactory"/>
>> >>>>>>>                       <charFilter
>> >>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
>> >>>>>>>
>> >>>>>>>                   </analyzer>
>> >>>>>>>
>> >>>>>>>            </fieldType>
>> >>>>>>>
>> >>>>>>> Unfortunatelly that did not fix the error. There are still<h3>
>>  tags
>> >>>>>>> inside the data. Although I believe there are viewer then before
>> but I
>> >>>>>>> can not prove that. Fact is, there are still html tags inside the
>> >>>>>>> data.
>> >>>>>>>
>> >>>>>>> Any other ideas what the problem could be?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> 2011/7/25 Markus Jelsma<markus.jelsma@**openindex.io<
>> markus.jelsma@openindex.io>
>> >>>>>>> >
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>> You've three analyzer elements, i wonder what that would do. You
>> need
>> >>>>>>>> to add
>> >>>>>>>> the char filter to the index-time analyzer.
>> >>>>>>>>
>> >>>>>>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>> Hi there,
>> >>>>>>>>>
>> >>>>>>>>> I am trying to strip html tags from the data before adding the
>> >>>>>>>>> documents
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>> to
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>> the index. To do that I altered schem.xml like this:
>> >>>>>>>>>            <fieldType name="text" class="solr.TextField"
>> >>>>>>>>>
>> >>>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
>> >>>>>>>>>
>> >>>>>>>>>                   <analyzer type="index">
>> >>>>>>>>>
>> >>>>>>>>>                       <tokenizer
>> >>>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>> >>>>>>>>>                       <filter
>> >>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
>> >>>>>>>>>
>> >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> >>>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> >>>>>>>>>
>> >>>>>>>>>                       <filter class="solr.**
>> >>>>>>>>> LowerCaseFilterFactory"/>
>> >>>>>>>>>                       <filter
>> >>>>>>>>>                       class="solr.**KeywordMarkerFilterFactory"/>
>> >>>>>>>>>                       <filter class="solr.**
>> >>>>>>>>> PorterStemFilterFactory"/>
>> >>>>>>>>>
>> >>>>>>>>>                   </analyzer>
>> >>>>>>>>>                   <analyzer type="query">
>> >>>>>>>>>
>> >>>>>>>>>                       <tokenizer
>> >>>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>> >>>>>>>>>                       <filter
>> >>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
>> >>>>>>>>>
>> >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> >>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>> >>>>>>>>>
>> >>>>>>>>>                       <filter class="solr.**
>> >>>>>>>>> LowerCaseFilterFactory"/>
>> >>>>>>>>>                       <filter
>> >>>>>>>>>                       class="solr.**KeywordMarkerFilterFactory"/>
>> >>>>>>>>>                       <filter class="solr.**
>> >>>>>>>>> PorterStemFilterFactory"/>
>> >>>>>>>>>
>> >>>>>>>>>                   </analyzer>
>> >>>>>>>>>                   <analyzer>
>> >>>>>>>>>
>> >>>>>>>>>                       <charFilter
>> >>>>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
>> >>>>>>>>>
>> >>>>>>>>>                        <tokenizer
>> >>>>>>>>>
>>  class="solr.**WhitespaceTokenizerFactory"/>
>> >>>>>>>>>
>> >>>>>>>>>                   </analyzer>
>> >>>>>>>>>
>> >>>>>>>>>            </fieldType>
>> >>>>>>>>>
>> >>>>>>>>>       <fields>
>> >>>>>>>>>
>> >>>>>>>>>           <field name="text" type="text" indexed="true"
>> >>>>>>>>> stored="true"
>> >>>>>>>>>
>> >>>>>>>>> required="false"/>
>> >>>>>>>>>
>> >>>>>>>>>       </fields>
>> >>>>>>>>>
>> >>>>>>>>> Unfortunatelly this does not work, the hmtl tags like<h3>    are
>> >>>>>>>>> still
>> >>>>>>>>> present after restarting and reindexing. I also tryed
>> >>>>>>>>> htmlstriptransformer, but this did not work either.
>> >>>>>>>>>
>> >>>>>>>>> Has anybody an idea how to get this done? Thank you in advance
>> for
>> >>>>>>>>> any hint.
>> >>>>>>>>>
>> >>>>>>>>> Merlin
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>> --
>> >>>>>>>> Markus Jelsma - CTO - Openindex
>> >>>>>>>> http://www.linkedin.com/in/**markus17<
>> http://www.linkedin.com/in/markus17>
>> >>>>>>>> 050-8536620 / 06-50258350
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>
>> >>
>> >
>>
>

Re: strip html from data

Posted by Alexei Martchenko <al...@superdownloads.com.br>.

You can use <charFilter class="solr.HTMLStripCharFilterFactory"/> like here
in this example. Check the docs about your specific SOLR version because
something has changed in the htmlstrip syntax in 1.4 and 3.x

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</fieldType>

2011/8/11 Merlin Morgenstern <me...@googlemail.com>

> I am sorry, but I do not really understand the difference of indexed and
> returned result set.
>
> I look on the "returned" dataset via this command:
> solr/select/?q=id:533563&terms=true
>
> which gives me html tags like this ones: </b><br />
>
> I also tried to turn on TermsComponent, but it did not change anything:
> solr/select/?q=id:533563&terms=true
>
> The shema browser does not show any html tags inside the text field, just
> indexed words of the one dataset.
>
> Is there a way to strip the html tags completly and not index them? If not,
> how to I retrieve the results without html tags?
>
> Thank you for your help.
>
>
>
> 2011/8/9 Erick Erickson <er...@gmail.com>
>
> > OK, what does "not working" mean? You never answered Markus' question:
> >
> > "Are you looking at the returned result set or what you've actually
> > indexed?
> > Analyzers are not run on the stored data, only on indexed data."
> >
> > If "not working" means that your returned results contain the markup,
> then
> > you're confusing indexing and storing. All the analysis chains operate
> > on data sent into the indexing process. But the verbatim data is *stored*
> > prior to (or separate from) indexing.
> >
> > So my assumption is that you see data returned in the document with
> > markup, which is just as it should be, and there's no problem at all. And
> > your
> > actual indexed terms (try looking at the data with TermsComponent, or
> > admin/schema browser) will NOT have any markup.
> >
> > Perhaps you can back up a bit and describe what's failing .vs. what you
> > expect.
> >
> > Best
> > Erick
> >
> > On Mon, Aug 8, 2011 at 6:50 AM, Merlin Morgenstern
> > <me...@googlemail.com> wrote:
> > > Unfortunatelly I still cant get it running. The code I am using is the
> > > following:
> > >                <analyzer type="index">
> > >                    <charFilter
> class="solr.HTMLStripCharFilterFactory"/>
> > >                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >                    <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > >                    <filter class="solr.LowerCaseFilterFactory"/>
> > >                    <filter class="solr.KeywordMarkerFilterFactory"/>
> > >                    <filter class="solr.PorterStemFilterFactory"/>
> > >                </analyzer>
> > >                <analyzer type="query">
> > >                    <charFilter
> class="solr.HTMLStripCharFilterFactory"/>
> > >                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > >                    <filter class="solr.WordDelimiterFilterFactory"
> > > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > >                    <filter class="solr.LowerCaseFilterFactory"/>
> > >                    <filter class="solr.KeywordMarkerFilterFactory"/>
> > >                    <filter class="solr.PorterStemFilterFactory"/>
> > >                </analyzer>
> > >
> > > I also tried this one:
> > >
> > >    <types>
> > >         <fieldType name="text" class="solr.TextField"
> > > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> > >               <analyzer>
> > >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
> > >                  <tokenizer class="solr.StandardTokenizerFactory"/>
> > >                  <filter class="solr.StandardFilterFactory"/>
> > >            </analyzer>
> > >         </fieldType>
> > >    </types>
> > >      <field name="text" type="text" indexed="true" stored="true"
> > > required="false"/>
> > >
> > > none of those worked. I restartred solr after the shema update and
> > reindexed
> > > the data. No change, the html tags are still in there.
> > >
> > > Any other ideas? Maybe this is a bug in solr? I am using solr 3.3.0 on
> > suse
> > > linux.
> > >
> > > Thank you for any help on this.
> > >
> > >
> > >
> > > 2011/7/25 Mike Sokolov <so...@ifactory.com>
> > >
> > >> Hmm that looks like it's working fine.  I stand corrected.
> > >>
> > >>
> > >>
> > >> On 07/25/2011 12:24 PM, Markus Jelsma wrote:
> > >>
> > >>> I've seen that issue too and read comments on the list yet i've never
> > had
> > >>> trouble with the order, don't know what's going on. Check this
> > analyzer,
> > >>> i've
> > >>> moved the charFilter to the bottom:
> > >>>
> > >>> <analyzer type="index">
> > >>> <tokenizer class="solr.**WhitespaceTokenizerFactory"/>
> > >>> <filter class="solr.**WordDelimiterFilterFactory"
> generateWordParts="1"
> > >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > >>> catenateAll="0"
> > >>> splitOnCaseChange="1"/>
> > >>> <filter class="solr.**LowerCaseFilterFactory"/>
> > >>> <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt"
> > >>> ignoreCase="false" expand="true"/>
> > >>> <filter class="solr.StopFilterFactory" ignoreCase="false"
> > >>> words="stopwords.txt"/>
> > >>> <filter class="solr.**ASCIIFoldingFilterFactory"/>
> > >>> <filter class="solr.**SnowballPorterFilterFactory"
> > >>> protected="protwords.txt"
> > >>> language="Dutch"/>
> > >>> <filter class="solr.**RemoveDuplicatesTokenFilterFac**tory"/>
> > >>> <charFilter class="solr.**HTMLStripCharFilterFactory"/>
> > >>> </analyzer>
> > >>>
> > >>> The analysis chain still does its job as i expect for the input:
> > >>> <span>bla bla</span>
> > >>>
> > >>> Index Analyzer
> > >>> org.apache.solr.analysis.**HTMLStripCharFilterFactory
> > >>> {luceneMatchVersion=LUCENE_34}
> > >>> text    bla bla
> > >>> org.apache.solr.analysis.**WhitespaceTokenizerFactory
> > >>> {luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> org.apache.solr.analysis.**WordDelimiterFilterFactory
> > >>> {splitOnCaseChange=1,
> > >>> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34,
> > >>> generateWordParts=1, catenateAll=0, catenateNumbers=1}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> type    word    word
> > >>> org.apache.solr.analysis.**LowerCaseFilterFactory
> > >>> {luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> type    word    word
> > >>> org.apache.solr.analysis.**SynonymFilterFactory
> {synonyms=synonyms.txt,
> > >>> expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> type    word    word
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> org.apache.solr.analysis.**StopFilterFactory {words=stopwords.txt,
> > >>> ignoreCase=false, luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> type    word    word
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> org.apache.solr.analysis.**ASCIIFoldingFilterFactory
> > >>> {luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> type    word    word
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> org.apache.solr.analysis.**SnowballPorterFilterFactory
> > >>> {protected=protwords.txt,
> > >>> language=Dutch, luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> keyword         false   false
> > >>> type    word    word
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>> org.apache.solr.analysis.**RemoveDuplicatesTokenFilterFac**tory
> > >>> {luceneMatchVersion=LUCENE_34}
> > >>> position        1       2
> > >>> term text       bla     bla
> > >>> keyword         false   false
> > >>> type    word    word
> > >>> startOffset     6       10
> > >>> endOffset       9       13
> > >>>
> > >>>
> > >>> On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:
> > >>>
> > >>>
> > >>>> Hmm - I'm not sure about that; see
> > >>>> https://issues.apache.org/**jira/browse/SOLR-2119<
> > https://issues.apache.org/jira/browse/SOLR-2119>
> > >>>>
> > >>>> On 07/25/2011 12:01 PM, Markus Jelsma wrote:
> > >>>>
> > >>>>
> > >>>>> charFilters are executed first regardless of their position in the
> > >>>>> analyzer.
> > >>>>>
> > >>>>> On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
> > >>>>>
> > >>>>>
> > >>>>>> I think you need to list the charfilter earlier in the analysis
> > chain;
> > >>>>>> before the tokenizer.  Porbably Solr should tell you this...
> > >>>>>>
> > >>>>>> -Mike
> > >>>>>>
> > >>>>>> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>>> sounds logical. I just changed it to the following, restarted and
> > >>>>>>> reindexed
> > >>>>>>>
> > >>>>>>> with commit:
> > >>>>>>>            <fieldType name="text" class="solr.TextField"
> > >>>>>>>
> > >>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
> > >>>>>>>
> > >>>>>>>                   <analyzer type="index">
> > >>>>>>>
> > >>>>>>>                       <tokenizer
> > >>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
> > >>>>>>>                       <filter class="solr.**
> > >>>>>>> WordDelimiterFilterFactory"
> > >>>>>>>
> > >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > >>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > >>>>>>>
> > >>>>>>>                       <filter
> > class="solr.**LowerCaseFilterFactory"/>
> > >>>>>>>                       <filter class="solr.**
> > >>>>>>> KeywordMarkerFilterFactory"/>
> > >>>>>>>                       <filter class="solr.**
> > >>>>>>> PorterStemFilterFactory"/>
> > >>>>>>>                       <charFilter
> > >>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
> > >>>>>>>
> > >>>>>>>                   </analyzer>
> > >>>>>>>                   <analyzer type="query">
> > >>>>>>>
> > >>>>>>>                       <tokenizer
> > >>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
> > >>>>>>>                       <filter class="solr.**
> > >>>>>>> WordDelimiterFilterFactory"
> > >>>>>>>
> > >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > >>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > >>>>>>>
> > >>>>>>>                       <filter
> > class="solr.**LowerCaseFilterFactory"/>
> > >>>>>>>                       <filter class="solr.**
> > >>>>>>> KeywordMarkerFilterFactory"/>
> > >>>>>>>                       <filter class="solr.**
> > >>>>>>> PorterStemFilterFactory"/>
> > >>>>>>>                       <charFilter
> > >>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
> > >>>>>>>
> > >>>>>>>                   </analyzer>
> > >>>>>>>
> > >>>>>>>            </fieldType>
> > >>>>>>>
> > >>>>>>> Unfortunatelly that did not fix the error. There are still<h3>
> >  tags
> > >>>>>>> inside the data. Although I believe there are viewer then before
> > but I
> > >>>>>>> can not prove that. Fact is, there are still html tags inside the
> > >>>>>>> data.
> > >>>>>>>
> > >>>>>>> Any other ideas what the problem could be?
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> 2011/7/25 Markus Jelsma<markus.jelsma@**openindex.io<
> > markus.jelsma@openindex.io>
> > >>>>>>> >
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> You've three analyzer elements, i wonder what that would do. You
> > need
> > >>>>>>>> to add
> > >>>>>>>> the char filter to the index-time analyzer.
> > >>>>>>>>
> > >>>>>>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> Hi there,
> > >>>>>>>>>
> > >>>>>>>>> I am trying to strip html tags from the data before adding the
> > >>>>>>>>> documents
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>> to
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> the index. To do that I altered schem.xml like this:
> > >>>>>>>>>            <fieldType name="text" class="solr.TextField"
> > >>>>>>>>>
> > >>>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
> > >>>>>>>>>
> > >>>>>>>>>                   <analyzer type="index">
> > >>>>>>>>>
> > >>>>>>>>>                       <tokenizer
> > >>>>>>>>>
> class="solr.**WhitespaceTokenizerFactory"/>
> > >>>>>>>>>                       <filter
> > >>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
> > >>>>>>>>>
> > >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > >>>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > >>>>>>>>>
> > >>>>>>>>>                       <filter class="solr.**
> > >>>>>>>>> LowerCaseFilterFactory"/>
> > >>>>>>>>>                       <filter
> > >>>>>>>>>
> class="solr.**KeywordMarkerFilterFactory"/>
> > >>>>>>>>>                       <filter class="solr.**
> > >>>>>>>>> PorterStemFilterFactory"/>
> > >>>>>>>>>
> > >>>>>>>>>                   </analyzer>
> > >>>>>>>>>                   <analyzer type="query">
> > >>>>>>>>>
> > >>>>>>>>>                       <tokenizer
> > >>>>>>>>>
> class="solr.**WhitespaceTokenizerFactory"/>
> > >>>>>>>>>                       <filter
> > >>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
> > >>>>>>>>>
> > >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > >>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > >>>>>>>>>
> > >>>>>>>>>                       <filter class="solr.**
> > >>>>>>>>> LowerCaseFilterFactory"/>
> > >>>>>>>>>                       <filter
> > >>>>>>>>>
> class="solr.**KeywordMarkerFilterFactory"/>
> > >>>>>>>>>                       <filter class="solr.**
> > >>>>>>>>> PorterStemFilterFactory"/>
> > >>>>>>>>>
> > >>>>>>>>>                   </analyzer>
> > >>>>>>>>>                   <analyzer>
> > >>>>>>>>>
> > >>>>>>>>>                       <charFilter
> > >>>>>>>>>
> class="solr.**HTMLStripCharFilterFactory"/>
> > >>>>>>>>>
> > >>>>>>>>>                        <tokenizer
> > >>>>>>>>>
> >  class="solr.**WhitespaceTokenizerFactory"/>
> > >>>>>>>>>
> > >>>>>>>>>                   </analyzer>
> > >>>>>>>>>
> > >>>>>>>>>            </fieldType>
> > >>>>>>>>>
> > >>>>>>>>>       <fields>
> > >>>>>>>>>
> > >>>>>>>>>           <field name="text" type="text" indexed="true"
> > >>>>>>>>> stored="true"
> > >>>>>>>>>
> > >>>>>>>>> required="false"/>
> > >>>>>>>>>
> > >>>>>>>>>       </fields>
> > >>>>>>>>>
> > >>>>>>>>> Unfortunatelly this does not work, the hmtl tags like<h3>
>  are
> > >>>>>>>>> still
> > >>>>>>>>> present after restarting and reindexing. I also tryed
> > >>>>>>>>> htmlstriptransformer, but this did not work either.
> > >>>>>>>>>
> > >>>>>>>>> Has anybody an idea how to get this done? Thank you in advance
> > for
> > >>>>>>>>> any hint.
> > >>>>>>>>>
> > >>>>>>>>> Merlin
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Markus Jelsma - CTO - Openindex
> > >>>>>>>> http://www.linkedin.com/in/**markus17<
> > http://www.linkedin.com/in/markus17>
> > >>>>>>>> 050-8536620 / 06-50258350
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>
> > >>
> > >
> >
>



-- 

*Alexei Martchenko* | *CEO* | Superdownloads
alexei@superdownloads.com.br | alexei@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533

Re: strip html from data

Posted by Ahmet Arslan <io...@yahoo.com>.

> Is there a way to strip the html tags completly and not
> index them? If not,
> how to I retrieve the results without html tags?

How do you push documents to solr? You need to strip html tags before the analysis chain. For example, if you are using Data Import Handler, you can use HTMLStripTransformer.

 http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer

Re: strip html from data

Posted by Merlin Morgenstern <me...@googlemail.com>.

I am sorry, but I do not really understand the difference of indexed and
returned result set.

I look on the "returned" dataset via this command:
solr/select/?q=id:533563&terms=true

which gives me html tags like this ones: </b><br />

I also tried to turn on TermsComponent, but it did not change anything:
solr/select/?q=id:533563&terms=true

The shema browser does not show any html tags inside the text field, just
indexed words of the one dataset.

Is there a way to strip the html tags completly and not index them? If not,
how to I retrieve the results without html tags?

Thank you for your help.



2011/8/9 Erick Erickson <er...@gmail.com>

> OK, what does "not working" mean? You never answered Markus' question:
>
> "Are you looking at the returned result set or what you've actually
> indexed?
> Analyzers are not run on the stored data, only on indexed data."
>
> If "not working" means that your returned results contain the markup, then
> you're confusing indexing and storing. All the analysis chains operate
> on data sent into the indexing process. But the verbatim data is *stored*
> prior to (or separate from) indexing.
>
> So my assumption is that you see data returned in the document with
> markup, which is just as it should be, and there's no problem at all. And
> your
> actual indexed terms (try looking at the data with TermsComponent, or
> admin/schema browser) will NOT have any markup.
>
> Perhaps you can back up a bit and describe what's failing .vs. what you
> expect.
>
> Best
> Erick
>
> On Mon, Aug 8, 2011 at 6:50 AM, Merlin Morgenstern
> <me...@googlemail.com> wrote:
> > Unfortunatelly I still cant get it running. The code I am using is the
> > following:
> >                <analyzer type="index">
> >                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                    <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >                    <filter class="solr.LowerCaseFilterFactory"/>
> >                    <filter class="solr.KeywordMarkerFilterFactory"/>
> >                    <filter class="solr.PorterStemFilterFactory"/>
> >                </analyzer>
> >                <analyzer type="query">
> >                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                    <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >                    <filter class="solr.LowerCaseFilterFactory"/>
> >                    <filter class="solr.KeywordMarkerFilterFactory"/>
> >                    <filter class="solr.PorterStemFilterFactory"/>
> >                </analyzer>
> >
> > I also tried this one:
> >
> >    <types>
> >         <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >               <analyzer>
> >                <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                  <tokenizer class="solr.StandardTokenizerFactory"/>
> >                  <filter class="solr.StandardFilterFactory"/>
> >            </analyzer>
> >         </fieldType>
> >    </types>
> >      <field name="text" type="text" indexed="true" stored="true"
> > required="false"/>
> >
> > none of those worked. I restartred solr after the shema update and
> reindexed
> > the data. No change, the html tags are still in there.
> >
> > Any other ideas? Maybe this is a bug in solr? I am using solr 3.3.0 on
> suse
> > linux.
> >
> > Thank you for any help on this.
> >
> >
> >
> > 2011/7/25 Mike Sokolov <so...@ifactory.com>
> >
> >> Hmm that looks like it's working fine.  I stand corrected.
> >>
> >>
> >>
> >> On 07/25/2011 12:24 PM, Markus Jelsma wrote:
> >>
> >>> I've seen that issue too and read comments on the list yet i've never
> had
> >>> trouble with the order, don't know what's going on. Check this
> analyzer,
> >>> i've
> >>> moved the charFilter to the bottom:
> >>>
> >>> <analyzer type="index">
> >>> <tokenizer class="solr.**WhitespaceTokenizerFactory"/>
> >>> <filter class="solr.**WordDelimiterFilterFactory" generateWordParts="1"
> >>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> >>> catenateAll="0"
> >>> splitOnCaseChange="1"/>
> >>> <filter class="solr.**LowerCaseFilterFactory"/>
> >>> <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt"
> >>> ignoreCase="false" expand="true"/>
> >>> <filter class="solr.StopFilterFactory" ignoreCase="false"
> >>> words="stopwords.txt"/>
> >>> <filter class="solr.**ASCIIFoldingFilterFactory"/>
> >>> <filter class="solr.**SnowballPorterFilterFactory"
> >>> protected="protwords.txt"
> >>> language="Dutch"/>
> >>> <filter class="solr.**RemoveDuplicatesTokenFilterFac**tory"/>
> >>> <charFilter class="solr.**HTMLStripCharFilterFactory"/>
> >>> </analyzer>
> >>>
> >>> The analysis chain still does its job as i expect for the input:
> >>> <span>bla bla</span>
> >>>
> >>> Index Analyzer
> >>> org.apache.solr.analysis.**HTMLStripCharFilterFactory
> >>> {luceneMatchVersion=LUCENE_34}
> >>> text    bla bla
> >>> org.apache.solr.analysis.**WhitespaceTokenizerFactory
> >>> {luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> org.apache.solr.analysis.**WordDelimiterFilterFactory
> >>> {splitOnCaseChange=1,
> >>> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34,
> >>> generateWordParts=1, catenateAll=0, catenateNumbers=1}
> >>> position        1       2
> >>> term text       bla     bla
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> type    word    word
> >>> org.apache.solr.analysis.**LowerCaseFilterFactory
> >>> {luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> type    word    word
> >>> org.apache.solr.analysis.**SynonymFilterFactory {synonyms=synonyms.txt,
> >>> expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> type    word    word
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> org.apache.solr.analysis.**StopFilterFactory {words=stopwords.txt,
> >>> ignoreCase=false, luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> type    word    word
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> org.apache.solr.analysis.**ASCIIFoldingFilterFactory
> >>> {luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> type    word    word
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> org.apache.solr.analysis.**SnowballPorterFilterFactory
> >>> {protected=protwords.txt,
> >>> language=Dutch, luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> keyword         false   false
> >>> type    word    word
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>> org.apache.solr.analysis.**RemoveDuplicatesTokenFilterFac**tory
> >>> {luceneMatchVersion=LUCENE_34}
> >>> position        1       2
> >>> term text       bla     bla
> >>> keyword         false   false
> >>> type    word    word
> >>> startOffset     6       10
> >>> endOffset       9       13
> >>>
> >>>
> >>> On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:
> >>>
> >>>
> >>>> Hmm - I'm not sure about that; see
> >>>> https://issues.apache.org/**jira/browse/SOLR-2119<
> https://issues.apache.org/jira/browse/SOLR-2119>
> >>>>
> >>>> On 07/25/2011 12:01 PM, Markus Jelsma wrote:
> >>>>
> >>>>
> >>>>> charFilters are executed first regardless of their position in the
> >>>>> analyzer.
> >>>>>
> >>>>> On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
> >>>>>
> >>>>>
> >>>>>> I think you need to list the charfilter earlier in the analysis
> chain;
> >>>>>> before the tokenizer.  Porbably Solr should tell you this...
> >>>>>>
> >>>>>> -Mike
> >>>>>>
> >>>>>> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
> >>>>>>
> >>>>>>
> >>>>>>> sounds logical. I just changed it to the following, restarted and
> >>>>>>> reindexed
> >>>>>>>
> >>>>>>> with commit:
> >>>>>>>            <fieldType name="text" class="solr.TextField"
> >>>>>>>
> >>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
> >>>>>>>
> >>>>>>>                   <analyzer type="index">
> >>>>>>>
> >>>>>>>                       <tokenizer
> >>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
> >>>>>>>                       <filter class="solr.**
> >>>>>>> WordDelimiterFilterFactory"
> >>>>>>>
> >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>>
> >>>>>>>                       <filter
> class="solr.**LowerCaseFilterFactory"/>
> >>>>>>>                       <filter class="solr.**
> >>>>>>> KeywordMarkerFilterFactory"/>
> >>>>>>>                       <filter class="solr.**
> >>>>>>> PorterStemFilterFactory"/>
> >>>>>>>                       <charFilter
> >>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
> >>>>>>>
> >>>>>>>                   </analyzer>
> >>>>>>>                   <analyzer type="query">
> >>>>>>>
> >>>>>>>                       <tokenizer
> >>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
> >>>>>>>                       <filter class="solr.**
> >>>>>>> WordDelimiterFilterFactory"
> >>>>>>>
> >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>>
> >>>>>>>                       <filter
> class="solr.**LowerCaseFilterFactory"/>
> >>>>>>>                       <filter class="solr.**
> >>>>>>> KeywordMarkerFilterFactory"/>
> >>>>>>>                       <filter class="solr.**
> >>>>>>> PorterStemFilterFactory"/>
> >>>>>>>                       <charFilter
> >>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
> >>>>>>>
> >>>>>>>                   </analyzer>
> >>>>>>>
> >>>>>>>            </fieldType>
> >>>>>>>
> >>>>>>> Unfortunatelly that did not fix the error. There are still<h3>
>  tags
> >>>>>>> inside the data. Although I believe there are viewer then before
> but I
> >>>>>>> can not prove that. Fact is, there are still html tags inside the
> >>>>>>> data.
> >>>>>>>
> >>>>>>> Any other ideas what the problem could be?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> 2011/7/25 Markus Jelsma<markus.jelsma@**openindex.io<
> markus.jelsma@openindex.io>
> >>>>>>> >
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> You've three analyzer elements, i wonder what that would do. You
> need
> >>>>>>>> to add
> >>>>>>>> the char filter to the index-time analyzer.
> >>>>>>>>
> >>>>>>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Hi there,
> >>>>>>>>>
> >>>>>>>>> I am trying to strip html tags from the data before adding the
> >>>>>>>>> documents
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> to
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> the index. To do that I altered schem.xml like this:
> >>>>>>>>>            <fieldType name="text" class="solr.TextField"
> >>>>>>>>>
> >>>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
> >>>>>>>>>
> >>>>>>>>>                   <analyzer type="index">
> >>>>>>>>>
> >>>>>>>>>                       <tokenizer
> >>>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
> >>>>>>>>>                       <filter
> >>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
> >>>>>>>>>
> >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>>>>
> >>>>>>>>>                       <filter class="solr.**
> >>>>>>>>> LowerCaseFilterFactory"/>
> >>>>>>>>>                       <filter
> >>>>>>>>>                       class="solr.**KeywordMarkerFilterFactory"/>
> >>>>>>>>>                       <filter class="solr.**
> >>>>>>>>> PorterStemFilterFactory"/>
> >>>>>>>>>
> >>>>>>>>>                   </analyzer>
> >>>>>>>>>                   <analyzer type="query">
> >>>>>>>>>
> >>>>>>>>>                       <tokenizer
> >>>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
> >>>>>>>>>                       <filter
> >>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
> >>>>>>>>>
> >>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>>>>
> >>>>>>>>>                       <filter class="solr.**
> >>>>>>>>> LowerCaseFilterFactory"/>
> >>>>>>>>>                       <filter
> >>>>>>>>>                       class="solr.**KeywordMarkerFilterFactory"/>
> >>>>>>>>>                       <filter class="solr.**
> >>>>>>>>> PorterStemFilterFactory"/>
> >>>>>>>>>
> >>>>>>>>>                   </analyzer>
> >>>>>>>>>                   <analyzer>
> >>>>>>>>>
> >>>>>>>>>                       <charFilter
> >>>>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
> >>>>>>>>>
> >>>>>>>>>                        <tokenizer
> >>>>>>>>>
>  class="solr.**WhitespaceTokenizerFactory"/>
> >>>>>>>>>
> >>>>>>>>>                   </analyzer>
> >>>>>>>>>
> >>>>>>>>>            </fieldType>
> >>>>>>>>>
> >>>>>>>>>       <fields>
> >>>>>>>>>
> >>>>>>>>>           <field name="text" type="text" indexed="true"
> >>>>>>>>> stored="true"
> >>>>>>>>>
> >>>>>>>>> required="false"/>
> >>>>>>>>>
> >>>>>>>>>       </fields>
> >>>>>>>>>
> >>>>>>>>> Unfortunatelly this does not work, the hmtl tags like<h3>    are
> >>>>>>>>> still
> >>>>>>>>> present after restarting and reindexing. I also tryed
> >>>>>>>>> htmlstriptransformer, but this did not work either.
> >>>>>>>>>
> >>>>>>>>> Has anybody an idea how to get this done? Thank you in advance
> for
> >>>>>>>>> any hint.
> >>>>>>>>>
> >>>>>>>>> Merlin
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> --
> >>>>>>>> Markus Jelsma - CTO - Openindex
> >>>>>>>> http://www.linkedin.com/in/**markus17<
> http://www.linkedin.com/in/markus17>
> >>>>>>>> 050-8536620 / 06-50258350
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>
> >>
> >
>

Re: strip html from data

Posted by Erick Erickson <er...@gmail.com>.

OK, what does "not working" mean? You never answered Markus' question:

"Are you looking at the returned result set or what you've actually indexed?
Analyzers are not run on the stored data, only on indexed data."

If "not working" means that your returned results contain the markup, then
you're confusing indexing and storing. All the analysis chains operate
on data sent into the indexing process. But the verbatim data is *stored*
prior to (or separate from) indexing.

So my assumption is that you see data returned in the document with
markup, which is just as it should be, and there's no problem at all. And your
actual indexed terms (try looking at the data with TermsComponent, or
admin/schema browser) will NOT have any markup.

Perhaps you can back up a bit and describe what's failing .vs. what you
expect.

Best
Erick

On Mon, Aug 8, 2011 at 6:50 AM, Merlin Morgenstern
<me...@googlemail.com> wrote:
> Unfortunatelly I still cant get it running. The code I am using is the
> following:
>                <analyzer type="index">
>                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                    <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>                    <filter class="solr.LowerCaseFilterFactory"/>
>                    <filter class="solr.KeywordMarkerFilterFactory"/>
>                    <filter class="solr.PorterStemFilterFactory"/>
>                </analyzer>
>                <analyzer type="query">
>                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                    <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>                    <filter class="solr.LowerCaseFilterFactory"/>
>                    <filter class="solr.KeywordMarkerFilterFactory"/>
>                    <filter class="solr.PorterStemFilterFactory"/>
>                </analyzer>
>
> I also tried this one:
>
>    <types>
>         <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>               <analyzer>
>                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                  <tokenizer class="solr.StandardTokenizerFactory"/>
>                  <filter class="solr.StandardFilterFactory"/>
>            </analyzer>
>         </fieldType>
>    </types>
>      <field name="text" type="text" indexed="true" stored="true"
> required="false"/>
>
> none of those worked. I restartred solr after the shema update and reindexed
> the data. No change, the html tags are still in there.
>
> Any other ideas? Maybe this is a bug in solr? I am using solr 3.3.0 on suse
> linux.
>
> Thank you for any help on this.
>
>
>
> 2011/7/25 Mike Sokolov <so...@ifactory.com>
>
>> Hmm that looks like it's working fine.  I stand corrected.
>>
>>
>>
>> On 07/25/2011 12:24 PM, Markus Jelsma wrote:
>>
>>> I've seen that issue too and read comments on the list yet i've never had
>>> trouble with the order, don't know what's going on. Check this analyzer,
>>> i've
>>> moved the charFilter to the bottom:
>>>
>>> <analyzer type="index">
>>> <tokenizer class="solr.**WhitespaceTokenizerFactory"/>
>>> <filter class="solr.**WordDelimiterFilterFactory" generateWordParts="1"
>>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>>> catenateAll="0"
>>> splitOnCaseChange="1"/>
>>> <filter class="solr.**LowerCaseFilterFactory"/>
>>> <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt"
>>> ignoreCase="false" expand="true"/>
>>> <filter class="solr.StopFilterFactory" ignoreCase="false"
>>> words="stopwords.txt"/>
>>> <filter class="solr.**ASCIIFoldingFilterFactory"/>
>>> <filter class="solr.**SnowballPorterFilterFactory"
>>> protected="protwords.txt"
>>> language="Dutch"/>
>>> <filter class="solr.**RemoveDuplicatesTokenFilterFac**tory"/>
>>> <charFilter class="solr.**HTMLStripCharFilterFactory"/>
>>> </analyzer>
>>>
>>> The analysis chain still does its job as i expect for the input:
>>> <span>bla bla</span>
>>>
>>> Index Analyzer
>>> org.apache.solr.analysis.**HTMLStripCharFilterFactory
>>> {luceneMatchVersion=LUCENE_34}
>>> text    bla bla
>>> org.apache.solr.analysis.**WhitespaceTokenizerFactory
>>> {luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> startOffset     6       10
>>> endOffset       9       13
>>> org.apache.solr.analysis.**WordDelimiterFilterFactory
>>> {splitOnCaseChange=1,
>>> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34,
>>> generateWordParts=1, catenateAll=0, catenateNumbers=1}
>>> position        1       2
>>> term text       bla     bla
>>> startOffset     6       10
>>> endOffset       9       13
>>> type    word    word
>>> org.apache.solr.analysis.**LowerCaseFilterFactory
>>> {luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> startOffset     6       10
>>> endOffset       9       13
>>> type    word    word
>>> org.apache.solr.analysis.**SynonymFilterFactory {synonyms=synonyms.txt,
>>> expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> type    word    word
>>> startOffset     6       10
>>> endOffset       9       13
>>> org.apache.solr.analysis.**StopFilterFactory {words=stopwords.txt,
>>> ignoreCase=false, luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> type    word    word
>>> startOffset     6       10
>>> endOffset       9       13
>>> org.apache.solr.analysis.**ASCIIFoldingFilterFactory
>>> {luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> type    word    word
>>> startOffset     6       10
>>> endOffset       9       13
>>> org.apache.solr.analysis.**SnowballPorterFilterFactory
>>> {protected=protwords.txt,
>>> language=Dutch, luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> keyword         false   false
>>> type    word    word
>>> startOffset     6       10
>>> endOffset       9       13
>>> org.apache.solr.analysis.**RemoveDuplicatesTokenFilterFac**tory
>>> {luceneMatchVersion=LUCENE_34}
>>> position        1       2
>>> term text       bla     bla
>>> keyword         false   false
>>> type    word    word
>>> startOffset     6       10
>>> endOffset       9       13
>>>
>>>
>>> On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:
>>>
>>>
>>>> Hmm - I'm not sure about that; see
>>>> https://issues.apache.org/**jira/browse/SOLR-2119<https://issues.apache.org/jira/browse/SOLR-2119>
>>>>
>>>> On 07/25/2011 12:01 PM, Markus Jelsma wrote:
>>>>
>>>>
>>>>> charFilters are executed first regardless of their position in the
>>>>> analyzer.
>>>>>
>>>>> On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
>>>>>
>>>>>
>>>>>> I think you need to list the charfilter earlier in the analysis chain;
>>>>>> before the tokenizer.  Porbably Solr should tell you this...
>>>>>>
>>>>>> -Mike
>>>>>>
>>>>>> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
>>>>>>
>>>>>>
>>>>>>> sounds logical. I just changed it to the following, restarted and
>>>>>>> reindexed
>>>>>>>
>>>>>>> with commit:
>>>>>>>            <fieldType name="text" class="solr.TextField"
>>>>>>>
>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
>>>>>>>
>>>>>>>                   <analyzer type="index">
>>>>>>>
>>>>>>>                       <tokenizer
>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>>                       <filter class="solr.**
>>>>>>> WordDelimiterFilterFactory"
>>>>>>>
>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>
>>>>>>>                       <filter class="solr.**LowerCaseFilterFactory"/>
>>>>>>>                       <filter class="solr.**
>>>>>>> KeywordMarkerFilterFactory"/>
>>>>>>>                       <filter class="solr.**
>>>>>>> PorterStemFilterFactory"/>
>>>>>>>                       <charFilter
>>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
>>>>>>>
>>>>>>>                   </analyzer>
>>>>>>>                   <analyzer type="query">
>>>>>>>
>>>>>>>                       <tokenizer
>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>>                       <filter class="solr.**
>>>>>>> WordDelimiterFilterFactory"
>>>>>>>
>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>
>>>>>>>                       <filter class="solr.**LowerCaseFilterFactory"/>
>>>>>>>                       <filter class="solr.**
>>>>>>> KeywordMarkerFilterFactory"/>
>>>>>>>                       <filter class="solr.**
>>>>>>> PorterStemFilterFactory"/>
>>>>>>>                       <charFilter
>>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
>>>>>>>
>>>>>>>                   </analyzer>
>>>>>>>
>>>>>>>            </fieldType>
>>>>>>>
>>>>>>> Unfortunatelly that did not fix the error. There are still<h3>    tags
>>>>>>> inside the data. Although I believe there are viewer then before but I
>>>>>>> can not prove that. Fact is, there are still html tags inside the
>>>>>>> data.
>>>>>>>
>>>>>>> Any other ideas what the problem could be?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2011/7/25 Markus Jelsma<ma...@openindex.io>
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> You've three analyzer elements, i wonder what that would do. You need
>>>>>>>> to add
>>>>>>>> the char filter to the index-time analyzer.
>>>>>>>>
>>>>>>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi there,
>>>>>>>>>
>>>>>>>>> I am trying to strip html tags from the data before adding the
>>>>>>>>> documents
>>>>>>>>>
>>>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> the index. To do that I altered schem.xml like this:
>>>>>>>>>            <fieldType name="text" class="solr.TextField"
>>>>>>>>>
>>>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
>>>>>>>>>
>>>>>>>>>                   <analyzer type="index">
>>>>>>>>>
>>>>>>>>>                       <tokenizer
>>>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>>>>                       <filter
>>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
>>>>>>>>>
>>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>>>
>>>>>>>>>                       <filter class="solr.**
>>>>>>>>> LowerCaseFilterFactory"/>
>>>>>>>>>                       <filter
>>>>>>>>>                       class="solr.**KeywordMarkerFilterFactory"/>
>>>>>>>>>                       <filter class="solr.**
>>>>>>>>> PorterStemFilterFactory"/>
>>>>>>>>>
>>>>>>>>>                   </analyzer>
>>>>>>>>>                   <analyzer type="query">
>>>>>>>>>
>>>>>>>>>                       <tokenizer
>>>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>>>>                       <filter
>>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
>>>>>>>>>
>>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>>>
>>>>>>>>>                       <filter class="solr.**
>>>>>>>>> LowerCaseFilterFactory"/>
>>>>>>>>>                       <filter
>>>>>>>>>                       class="solr.**KeywordMarkerFilterFactory"/>
>>>>>>>>>                       <filter class="solr.**
>>>>>>>>> PorterStemFilterFactory"/>
>>>>>>>>>
>>>>>>>>>                   </analyzer>
>>>>>>>>>                   <analyzer>
>>>>>>>>>
>>>>>>>>>                       <charFilter
>>>>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
>>>>>>>>>
>>>>>>>>>                        <tokenizer
>>>>>>>>>                        class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>>>>
>>>>>>>>>                   </analyzer>
>>>>>>>>>
>>>>>>>>>            </fieldType>
>>>>>>>>>
>>>>>>>>>       <fields>
>>>>>>>>>
>>>>>>>>>           <field name="text" type="text" indexed="true"
>>>>>>>>> stored="true"
>>>>>>>>>
>>>>>>>>> required="false"/>
>>>>>>>>>
>>>>>>>>>       </fields>
>>>>>>>>>
>>>>>>>>> Unfortunatelly this does not work, the hmtl tags like<h3>    are
>>>>>>>>> still
>>>>>>>>> present after restarting and reindexing. I also tryed
>>>>>>>>> htmlstriptransformer, but this did not work either.
>>>>>>>>>
>>>>>>>>> Has anybody an idea how to get this done? Thank you in advance for
>>>>>>>>> any hint.
>>>>>>>>>
>>>>>>>>> Merlin
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Markus Jelsma - CTO - Openindex
>>>>>>>> http://www.linkedin.com/in/**markus17<http://www.linkedin.com/in/markus17>
>>>>>>>> 050-8536620 / 06-50258350
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>
>>
>

Re: strip html from data

Posted by Merlin Morgenstern <me...@googlemail.com>.

Unfortunatelly I still cant get it running. The code I am using is the
following:
                <analyzer type="index">
                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                    <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                    <filter class="solr.LowerCaseFilterFactory"/>
                    <filter class="solr.KeywordMarkerFilterFactory"/>
                    <filter class="solr.PorterStemFilterFactory"/>
                </analyzer>
                <analyzer type="query">
                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                    <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                    <filter class="solr.LowerCaseFilterFactory"/>
                    <filter class="solr.KeywordMarkerFilterFactory"/>
                    <filter class="solr.PorterStemFilterFactory"/>
                </analyzer>

I also tried this one:

    <types>
         <fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
               <analyzer>
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                  <tokenizer class="solr.StandardTokenizerFactory"/>
                  <filter class="solr.StandardFilterFactory"/>
            </analyzer>
         </fieldType>
    </types>
      <field name="text" type="text" indexed="true" stored="true"
required="false"/>

none of those worked. I restartred solr after the shema update and reindexed
the data. No change, the html tags are still in there.

Any other ideas? Maybe this is a bug in solr? I am using solr 3.3.0 on suse
linux.

Thank you for any help on this.



2011/7/25 Mike Sokolov <so...@ifactory.com>

> Hmm that looks like it's working fine.  I stand corrected.
>
>
>
> On 07/25/2011 12:24 PM, Markus Jelsma wrote:
>
>> I've seen that issue too and read comments on the list yet i've never had
>> trouble with the order, don't know what's going on. Check this analyzer,
>> i've
>> moved the charFilter to the bottom:
>>
>> <analyzer type="index">
>> <tokenizer class="solr.**WhitespaceTokenizerFactory"/>
>> <filter class="solr.**WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0"
>> splitOnCaseChange="1"/>
>> <filter class="solr.**LowerCaseFilterFactory"/>
>> <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="false" expand="true"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="false"
>> words="stopwords.txt"/>
>> <filter class="solr.**ASCIIFoldingFilterFactory"/>
>> <filter class="solr.**SnowballPorterFilterFactory"
>> protected="protwords.txt"
>> language="Dutch"/>
>> <filter class="solr.**RemoveDuplicatesTokenFilterFac**tory"/>
>> <charFilter class="solr.**HTMLStripCharFilterFactory"/>
>> </analyzer>
>>
>> The analysis chain still does its job as i expect for the input:
>> <span>bla bla</span>
>>
>> Index Analyzer
>> org.apache.solr.analysis.**HTMLStripCharFilterFactory
>> {luceneMatchVersion=LUCENE_34}
>> text    bla bla
>> org.apache.solr.analysis.**WhitespaceTokenizerFactory
>> {luceneMatchVersion=LUCENE_34}
>> position        1       2
>> term text       bla     bla
>> startOffset     6       10
>> endOffset       9       13
>> org.apache.solr.analysis.**WordDelimiterFilterFactory
>> {splitOnCaseChange=1,
>> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34,
>> generateWordParts=1, catenateAll=0, catenateNumbers=1}
>> position        1       2
>> term text       bla     bla
>> startOffset     6       10
>> endOffset       9       13
>> type    word    word
>> org.apache.solr.analysis.**LowerCaseFilterFactory
>> {luceneMatchVersion=LUCENE_34}
>> position        1       2
>> term text       bla     bla
>> startOffset     6       10
>> endOffset       9       13
>> type    word    word
>> org.apache.solr.analysis.**SynonymFilterFactory {synonyms=synonyms.txt,
>> expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34}
>> position        1       2
>> term text       bla     bla
>> type    word    word
>> startOffset     6       10
>> endOffset       9       13
>> org.apache.solr.analysis.**StopFilterFactory {words=stopwords.txt,
>> ignoreCase=false, luceneMatchVersion=LUCENE_34}
>> position        1       2
>> term text       bla     bla
>> type    word    word
>> startOffset     6       10
>> endOffset       9       13
>> org.apache.solr.analysis.**ASCIIFoldingFilterFactory
>> {luceneMatchVersion=LUCENE_34}
>> position        1       2
>> term text       bla     bla
>> type    word    word
>> startOffset     6       10
>> endOffset       9       13
>> org.apache.solr.analysis.**SnowballPorterFilterFactory
>> {protected=protwords.txt,
>> language=Dutch, luceneMatchVersion=LUCENE_34}
>> position        1       2
>> term text       bla     bla
>> keyword         false   false
>> type    word    word
>> startOffset     6       10
>> endOffset       9       13
>> org.apache.solr.analysis.**RemoveDuplicatesTokenFilterFac**tory
>> {luceneMatchVersion=LUCENE_34}
>> position        1       2
>> term text       bla     bla
>> keyword         false   false
>> type    word    word
>> startOffset     6       10
>> endOffset       9       13
>>
>>
>> On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:
>>
>>
>>> Hmm - I'm not sure about that; see
>>> https://issues.apache.org/**jira/browse/SOLR-2119<https://issues.apache.org/jira/browse/SOLR-2119>
>>>
>>> On 07/25/2011 12:01 PM, Markus Jelsma wrote:
>>>
>>>
>>>> charFilters are executed first regardless of their position in the
>>>> analyzer.
>>>>
>>>> On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
>>>>
>>>>
>>>>> I think you need to list the charfilter earlier in the analysis chain;
>>>>> before the tokenizer.  Porbably Solr should tell you this...
>>>>>
>>>>> -Mike
>>>>>
>>>>> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
>>>>>
>>>>>
>>>>>> sounds logical. I just changed it to the following, restarted and
>>>>>> reindexed
>>>>>>
>>>>>> with commit:
>>>>>>            <fieldType name="text" class="solr.TextField"
>>>>>>
>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
>>>>>>
>>>>>>                   <analyzer type="index">
>>>>>>
>>>>>>                       <tokenizer
>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>                       <filter class="solr.**
>>>>>> WordDelimiterFilterFactory"
>>>>>>
>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>
>>>>>>                       <filter class="solr.**LowerCaseFilterFactory"/>
>>>>>>                       <filter class="solr.**
>>>>>> KeywordMarkerFilterFactory"/>
>>>>>>                       <filter class="solr.**
>>>>>> PorterStemFilterFactory"/>
>>>>>>                       <charFilter
>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
>>>>>>
>>>>>>                   </analyzer>
>>>>>>                   <analyzer type="query">
>>>>>>
>>>>>>                       <tokenizer
>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>                       <filter class="solr.**
>>>>>> WordDelimiterFilterFactory"
>>>>>>
>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>
>>>>>>                       <filter class="solr.**LowerCaseFilterFactory"/>
>>>>>>                       <filter class="solr.**
>>>>>> KeywordMarkerFilterFactory"/>
>>>>>>                       <filter class="solr.**
>>>>>> PorterStemFilterFactory"/>
>>>>>>                       <charFilter
>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
>>>>>>
>>>>>>                   </analyzer>
>>>>>>
>>>>>>            </fieldType>
>>>>>>
>>>>>> Unfortunatelly that did not fix the error. There are still<h3>    tags
>>>>>> inside the data. Although I believe there are viewer then before but I
>>>>>> can not prove that. Fact is, there are still html tags inside the
>>>>>> data.
>>>>>>
>>>>>> Any other ideas what the problem could be?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2011/7/25 Markus Jelsma<ma...@openindex.io>
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>>> You've three analyzer elements, i wonder what that would do. You need
>>>>>>> to add
>>>>>>> the char filter to the index-time analyzer.
>>>>>>>
>>>>>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> I am trying to strip html tags from the data before adding the
>>>>>>>> documents
>>>>>>>>
>>>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> the index. To do that I altered schem.xml like this:
>>>>>>>>            <fieldType name="text" class="solr.TextField"
>>>>>>>>
>>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="**true">
>>>>>>>>
>>>>>>>>                   <analyzer type="index">
>>>>>>>>
>>>>>>>>                       <tokenizer
>>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>>>                       <filter
>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
>>>>>>>>
>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>>
>>>>>>>>                       <filter class="solr.**
>>>>>>>> LowerCaseFilterFactory"/>
>>>>>>>>                       <filter
>>>>>>>>                       class="solr.**KeywordMarkerFilterFactory"/>
>>>>>>>>                       <filter class="solr.**
>>>>>>>> PorterStemFilterFactory"/>
>>>>>>>>
>>>>>>>>                   </analyzer>
>>>>>>>>                   <analyzer type="query">
>>>>>>>>
>>>>>>>>                       <tokenizer
>>>>>>>>                       class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>>>                       <filter
>>>>>>>>                       class="solr.**WordDelimiterFilterFactory"
>>>>>>>>
>>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>>
>>>>>>>>                       <filter class="solr.**
>>>>>>>> LowerCaseFilterFactory"/>
>>>>>>>>                       <filter
>>>>>>>>                       class="solr.**KeywordMarkerFilterFactory"/>
>>>>>>>>                       <filter class="solr.**
>>>>>>>> PorterStemFilterFactory"/>
>>>>>>>>
>>>>>>>>                   </analyzer>
>>>>>>>>                   <analyzer>
>>>>>>>>
>>>>>>>>                       <charFilter
>>>>>>>>                       class="solr.**HTMLStripCharFilterFactory"/>
>>>>>>>>
>>>>>>>>                        <tokenizer
>>>>>>>>                        class="solr.**WhitespaceTokenizerFactory"/>
>>>>>>>>
>>>>>>>>                   </analyzer>
>>>>>>>>
>>>>>>>>            </fieldType>
>>>>>>>>
>>>>>>>>       <fields>
>>>>>>>>
>>>>>>>>           <field name="text" type="text" indexed="true"
>>>>>>>> stored="true"
>>>>>>>>
>>>>>>>> required="false"/>
>>>>>>>>
>>>>>>>>       </fields>
>>>>>>>>
>>>>>>>> Unfortunatelly this does not work, the hmtl tags like<h3>    are
>>>>>>>> still
>>>>>>>> present after restarting and reindexing. I also tryed
>>>>>>>> htmlstriptransformer, but this did not work either.
>>>>>>>>
>>>>>>>> Has anybody an idea how to get this done? Thank you in advance for
>>>>>>>> any hint.
>>>>>>>>
>>>>>>>> Merlin
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> Markus Jelsma - CTO - Openindex
>>>>>>> http://www.linkedin.com/in/**markus17<http://www.linkedin.com/in/markus17>
>>>>>>> 050-8536620 / 06-50258350
>>>>>>>
>>>>>>>
>>>>>>
>>
>

Re: strip html from data

Posted by Mike Sokolov <so...@ifactory.com>.

Hmm that looks like it's working fine.  I stand corrected.


On 07/25/2011 12:24 PM, Markus Jelsma wrote:
> I've seen that issue too and read comments on the list yet i've never had
> trouble with the order, don't know what's going on. Check this analyzer, i've
> moved the charFilter to the bottom:
>
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="false" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="false"
> words="stopwords.txt"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory" protected="protwords.txt"
> language="Dutch"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> </analyzer>
>
> The analysis chain still does its job as i expect for the input:
> <span>bla bla</span>
>
> Index Analyzer
> org.apache.solr.analysis.HTMLStripCharFilterFactory
> {luceneMatchVersion=LUCENE_34}
> text 	bla bla
> org.apache.solr.analysis.WhitespaceTokenizerFactory
> {luceneMatchVersion=LUCENE_34}
> position 	1	2
> term text 	bla	bla
> startOffset 	6	10
> endOffset 	9	13
> org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
> generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34,
> generateWordParts=1, catenateAll=0, catenateNumbers=1}
> position 	1	2
> term text 	bla	bla
> startOffset 	6	10
> endOffset 	9	13
> type 	word	word
> org.apache.solr.analysis.LowerCaseFilterFactory {luceneMatchVersion=LUCENE_34}
> position 	1	2
> term text 	bla	bla
> startOffset 	6	10
> endOffset 	9	13
> type 	word	word
> org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
> expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34}
> position 	1	2
> term text 	bla	bla
> type 	word	word
> startOffset 	6	10
> endOffset 	9	13
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> ignoreCase=false, luceneMatchVersion=LUCENE_34}
> position 	1	2
> term text 	bla	bla
> type 	word	word
> startOffset 	6	10
> endOffset 	9	13
> org.apache.solr.analysis.ASCIIFoldingFilterFactory
> {luceneMatchVersion=LUCENE_34}
> position 	1	2
> term text 	bla	bla
> type 	word	word
> startOffset 	6	10
> endOffset 	9	13
> org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt,
> language=Dutch, luceneMatchVersion=LUCENE_34}
> position 	1	2
> term text 	bla	bla
> keyword 	false	false
> type 	word	word
> startOffset 	6	10
> endOffset 	9	13
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
> {luceneMatchVersion=LUCENE_34}
> position 	1	2
> term text 	bla	bla
> keyword 	false	false
> type 	word	word
> startOffset 	6	10
> endOffset 	9	13
>
>
> On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:
>    
>> Hmm - I'm not sure about that; see
>> https://issues.apache.org/jira/browse/SOLR-2119
>>
>> On 07/25/2011 12:01 PM, Markus Jelsma wrote:
>>      
>>> charFilters are executed first regardless of their position in the
>>> analyzer.
>>>
>>> On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
>>>        
>>>> I think you need to list the charfilter earlier in the analysis chain;
>>>> before the tokenizer.  Porbably Solr should tell you this...
>>>>
>>>> -Mike
>>>>
>>>> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
>>>>          
>>>>> sounds logical. I just changed it to the following, restarted and
>>>>> reindexed
>>>>>
>>>>> with commit:
>>>>>             <fieldType name="text" class="solr.TextField"
>>>>>
>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>>>>>
>>>>>                    <analyzer type="index">
>>>>>
>>>>>                        <tokenizer
>>>>>                        class="solr.WhitespaceTokenizerFactory"/>
>>>>>                        <filter class="solr.WordDelimiterFilterFactory"
>>>>>
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>
>>>>>                        <filter class="solr.LowerCaseFilterFactory"/>
>>>>>                        <filter class="solr.KeywordMarkerFilterFactory"/>
>>>>>                        <filter class="solr.PorterStemFilterFactory"/>
>>>>>                        <charFilter
>>>>>                        class="solr.HTMLStripCharFilterFactory"/>
>>>>>
>>>>>                    </analyzer>
>>>>>                    <analyzer type="query">
>>>>>
>>>>>                        <tokenizer
>>>>>                        class="solr.WhitespaceTokenizerFactory"/>
>>>>>                        <filter class="solr.WordDelimiterFilterFactory"
>>>>>
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>
>>>>>                        <filter class="solr.LowerCaseFilterFactory"/>
>>>>>                        <filter class="solr.KeywordMarkerFilterFactory"/>
>>>>>                        <filter class="solr.PorterStemFilterFactory"/>
>>>>>                        <charFilter
>>>>>                        class="solr.HTMLStripCharFilterFactory"/>
>>>>>
>>>>>                    </analyzer>
>>>>>
>>>>>             </fieldType>
>>>>>
>>>>> Unfortunatelly that did not fix the error. There are still<h3>    tags
>>>>> inside the data. Although I believe there are viewer then before but I
>>>>> can not prove that. Fact is, there are still html tags inside the data.
>>>>>
>>>>> Any other ideas what the problem could be?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2011/7/25 Markus Jelsma<ma...@openindex.io>
>>>>>
>>>>>            
>>>>>> You've three analyzer elements, i wonder what that would do. You need
>>>>>> to add
>>>>>> the char filter to the index-time analyzer.
>>>>>>
>>>>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
>>>>>>              
>>>>>>> Hi there,
>>>>>>>
>>>>>>> I am trying to strip html tags from the data before adding the
>>>>>>> documents
>>>>>>>                
>>>>>> to
>>>>>>
>>>>>>              
>>>>>>> the index. To do that I altered schem.xml like this:
>>>>>>>             <fieldType name="text" class="solr.TextField"
>>>>>>>
>>>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>>>>>>>
>>>>>>>                    <analyzer type="index">
>>>>>>>
>>>>>>>                        <tokenizer
>>>>>>>                        class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>                        <filter
>>>>>>>                        class="solr.WordDelimiterFilterFactory"
>>>>>>>
>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>
>>>>>>>                        <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>                        <filter
>>>>>>>                        class="solr.KeywordMarkerFilterFactory"/>
>>>>>>>                        <filter class="solr.PorterStemFilterFactory"/>
>>>>>>>
>>>>>>>                    </analyzer>
>>>>>>>                    <analyzer type="query">
>>>>>>>
>>>>>>>                        <tokenizer
>>>>>>>                        class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>                        <filter
>>>>>>>                        class="solr.WordDelimiterFilterFactory"
>>>>>>>
>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>
>>>>>>>                        <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>                        <filter
>>>>>>>                        class="solr.KeywordMarkerFilterFactory"/>
>>>>>>>                        <filter class="solr.PorterStemFilterFactory"/>
>>>>>>>
>>>>>>>                    </analyzer>
>>>>>>>                    <analyzer>
>>>>>>>
>>>>>>>                        <charFilter
>>>>>>>                        class="solr.HTMLStripCharFilterFactory"/>
>>>>>>>
>>>>>>>                         <tokenizer
>>>>>>>                         class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>
>>>>>>>                    </analyzer>
>>>>>>>
>>>>>>>             </fieldType>
>>>>>>>
>>>>>>>        <fields>
>>>>>>>
>>>>>>>            <field name="text" type="text" indexed="true" stored="true"
>>>>>>>
>>>>>>> required="false"/>
>>>>>>>
>>>>>>>        </fields>
>>>>>>>
>>>>>>> Unfortunatelly this does not work, the hmtl tags like<h3>    are still
>>>>>>> present after restarting and reindexing. I also tryed
>>>>>>> htmlstriptransformer, but this did not work either.
>>>>>>>
>>>>>>> Has anybody an idea how to get this done? Thank you in advance for
>>>>>>> any hint.
>>>>>>>
>>>>>>> Merlin
>>>>>>>                
>>>>>> --
>>>>>> Markus Jelsma - CTO - Openindex
>>>>>> http://www.linkedin.com/in/markus17
>>>>>> 050-8536620 / 06-50258350
>>>>>>              
>

Re: strip html from data

Posted by Markus Jelsma <ma...@openindex.io>.

I've seen that issue too and read comments on the list yet i've never had 
trouble with the order, don't know what's going on. Check this analyzer, i've 
moved the charFilter to the bottom:

<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="false" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="false" 
words="stopwords.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" protected="protwords.txt" 
language="Dutch"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>

The analysis chain still does its job as i expect for the input:
<span>bla bla</span>

Index Analyzer
org.apache.solr.analysis.HTMLStripCharFilterFactory 
{luceneMatchVersion=LUCENE_34}
text 	bla bla
org.apache.solr.analysis.WhitespaceTokenizerFactory 
{luceneMatchVersion=LUCENE_34}
position 	1	2
term text 	bla	bla
startOffset 	6	10
endOffset 	9	13
org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, 
generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_34, 
generateWordParts=1, catenateAll=0, catenateNumbers=1}
position 	1	2
term text 	bla	bla
startOffset 	6	10
endOffset 	9	13
type 	word	word
org.apache.solr.analysis.LowerCaseFilterFactory {luceneMatchVersion=LUCENE_34}
position 	1	2
term text 	bla	bla
startOffset 	6	10
endOffset 	9	13
type 	word	word
org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, 
expand=true, ignoreCase=false, luceneMatchVersion=LUCENE_34}
position 	1	2
term text 	bla	bla
type 	word	word
startOffset 	6	10
endOffset 	9	13
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, 
ignoreCase=false, luceneMatchVersion=LUCENE_34}
position 	1	2
term text 	bla	bla
type 	word	word
startOffset 	6	10
endOffset 	9	13
org.apache.solr.analysis.ASCIIFoldingFilterFactory 
{luceneMatchVersion=LUCENE_34}
position 	1	2
term text 	bla	bla
type 	word	word
startOffset 	6	10
endOffset 	9	13
org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt, 
language=Dutch, luceneMatchVersion=LUCENE_34}
position 	1	2
term text 	bla	bla
keyword 	false	false
type 	word	word
startOffset 	6	10
endOffset 	9	13
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory 
{luceneMatchVersion=LUCENE_34}
position 	1	2
term text 	bla	bla
keyword 	false	false
type 	word	word
startOffset 	6	10
endOffset 	9	13


On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:
> Hmm - I'm not sure about that; see
> https://issues.apache.org/jira/browse/SOLR-2119
> 
> On 07/25/2011 12:01 PM, Markus Jelsma wrote:
> > charFilters are executed first regardless of their position in the
> > analyzer.
> > 
> > On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
> >> I think you need to list the charfilter earlier in the analysis chain;
> >> before the tokenizer.  Porbably Solr should tell you this...
> >> 
> >> -Mike
> >> 
> >> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
> >>> sounds logical. I just changed it to the following, restarted and
> >>> reindexed
> >>> 
> >>> with commit:
> >>>            <fieldType name="text" class="solr.TextField"
> >>> 
> >>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >>> 
> >>>                   <analyzer type="index">
> >>>                   
> >>>                       <tokenizer
> >>>                       class="solr.WhitespaceTokenizerFactory"/>
> >>>                       <filter class="solr.WordDelimiterFilterFactory"
> >>> 
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>> 
> >>>                       <filter class="solr.LowerCaseFilterFactory"/>
> >>>                       <filter class="solr.KeywordMarkerFilterFactory"/>
> >>>                       <filter class="solr.PorterStemFilterFactory"/>
> >>>                       <charFilter
> >>>                       class="solr.HTMLStripCharFilterFactory"/>
> >>>                   
> >>>                   </analyzer>
> >>>                   <analyzer type="query">
> >>>                   
> >>>                       <tokenizer
> >>>                       class="solr.WhitespaceTokenizerFactory"/>
> >>>                       <filter class="solr.WordDelimiterFilterFactory"
> >>> 
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>> 
> >>>                       <filter class="solr.LowerCaseFilterFactory"/>
> >>>                       <filter class="solr.KeywordMarkerFilterFactory"/>
> >>>                       <filter class="solr.PorterStemFilterFactory"/>
> >>>                       <charFilter
> >>>                       class="solr.HTMLStripCharFilterFactory"/>
> >>>                   
> >>>                   </analyzer>
> >>>            
> >>>            </fieldType>
> >>> 
> >>> Unfortunatelly that did not fix the error. There are still<h3>   tags
> >>> inside the data. Although I believe there are viewer then before but I
> >>> can not prove that. Fact is, there are still html tags inside the data.
> >>> 
> >>> Any other ideas what the problem could be?
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 2011/7/25 Markus Jelsma<ma...@openindex.io>
> >>> 
> >>>> You've three analyzer elements, i wonder what that would do. You need
> >>>> to add
> >>>> the char filter to the index-time analyzer.
> >>>> 
> >>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
> >>>>> Hi there,
> >>>>> 
> >>>>> I am trying to strip html tags from the data before adding the
> >>>>> documents
> >>>> 
> >>>> to
> >>>> 
> >>>>> the index. To do that I altered schem.xml like this:
> >>>>>            <fieldType name="text" class="solr.TextField"
> >>>>> 
> >>>>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >>>>> 
> >>>>>                   <analyzer type="index">
> >>>>>                   
> >>>>>                       <tokenizer
> >>>>>                       class="solr.WhitespaceTokenizerFactory"/> 
> >>>>>                       <filter
> >>>>>                       class="solr.WordDelimiterFilterFactory"
> >>>>> 
> >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>>> 
> >>>>>                       <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>                       <filter
> >>>>>                       class="solr.KeywordMarkerFilterFactory"/>
> >>>>>                       <filter class="solr.PorterStemFilterFactory"/>
> >>>>>                   
> >>>>>                   </analyzer>
> >>>>>                   <analyzer type="query">
> >>>>>                   
> >>>>>                       <tokenizer
> >>>>>                       class="solr.WhitespaceTokenizerFactory"/> 
> >>>>>                       <filter
> >>>>>                       class="solr.WordDelimiterFilterFactory"
> >>>>> 
> >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>>> 
> >>>>>                       <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>                       <filter
> >>>>>                       class="solr.KeywordMarkerFilterFactory"/>
> >>>>>                       <filter class="solr.PorterStemFilterFactory"/>
> >>>>>                   
> >>>>>                   </analyzer>
> >>>>>                   <analyzer>
> >>>>>                   
> >>>>>                       <charFilter
> >>>>>                       class="solr.HTMLStripCharFilterFactory"/>
> >>>>>                       
> >>>>>                        <tokenizer
> >>>>>                        class="solr.WhitespaceTokenizerFactory"/>
> >>>>>                   
> >>>>>                   </analyzer>
> >>>>>            
> >>>>>            </fieldType>
> >>>>>       
> >>>>>       <fields>
> >>>>>       
> >>>>>           <field name="text" type="text" indexed="true" stored="true"
> >>>>> 
> >>>>> required="false"/>
> >>>>> 
> >>>>>       </fields>
> >>>>> 
> >>>>> Unfortunatelly this does not work, the hmtl tags like<h3>   are still
> >>>>> present after restarting and reindexing. I also tryed
> >>>>> htmlstriptransformer, but this did not work either.
> >>>>> 
> >>>>> Has anybody an idea how to get this done? Thank you in advance for
> >>>>> any hint.
> >>>>> 
> >>>>> Merlin
> >>>> 
> >>>> --
> >>>> Markus Jelsma - CTO - Openindex
> >>>> http://www.linkedin.com/in/markus17
> >>>> 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: strip html from data

Posted by Mike Sokolov <so...@ifactory.com>.

Hmm - I'm not sure about that; see 
https://issues.apache.org/jira/browse/SOLR-2119

On 07/25/2011 12:01 PM, Markus Jelsma wrote:
> charFilters are executed first regardless of their position in the analyzer.
>
> On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
>    
>> I think you need to list the charfilter earlier in the analysis chain;
>> before the tokenizer.  Porbably Solr should tell you this...
>>
>> -Mike
>>
>> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
>>      
>>> sounds logical. I just changed it to the following, restarted and
>>> reindexed
>>>
>>> with commit:
>>>            <fieldType name="text" class="solr.TextField"
>>>
>>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>>>
>>>                   <analyzer type="index">
>>>
>>>                       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>                       <filter class="solr.WordDelimiterFilterFactory"
>>>
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>
>>>                       <filter class="solr.LowerCaseFilterFactory"/>
>>>                       <filter class="solr.KeywordMarkerFilterFactory"/>
>>>                       <filter class="solr.PorterStemFilterFactory"/>
>>>                       <charFilter
>>>                       class="solr.HTMLStripCharFilterFactory"/>
>>>
>>>                   </analyzer>
>>>                   <analyzer type="query">
>>>
>>>                       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>                       <filter class="solr.WordDelimiterFilterFactory"
>>>
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>
>>>                       <filter class="solr.LowerCaseFilterFactory"/>
>>>                       <filter class="solr.KeywordMarkerFilterFactory"/>
>>>                       <filter class="solr.PorterStemFilterFactory"/>
>>>                       <charFilter
>>>                       class="solr.HTMLStripCharFilterFactory"/>
>>>
>>>                   </analyzer>
>>>
>>>            </fieldType>
>>>
>>> Unfortunatelly that did not fix the error. There are still<h3>   tags
>>> inside the data. Although I believe there are viewer then before but I
>>> can not prove that. Fact is, there are still html tags inside the data.
>>>
>>> Any other ideas what the problem could be?
>>>
>>>
>>>
>>>
>>>
>>> 2011/7/25 Markus Jelsma<ma...@openindex.io>
>>>
>>>        
>>>> You've three analyzer elements, i wonder what that would do. You need to
>>>> add
>>>> the char filter to the index-time analyzer.
>>>>
>>>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
>>>>          
>>>>> Hi there,
>>>>>
>>>>> I am trying to strip html tags from the data before adding the
>>>>> documents
>>>>>            
>>>> to
>>>>
>>>>          
>>>>> the index. To do that I altered schem.xml like this:
>>>>>            <fieldType name="text" class="solr.TextField"
>>>>>
>>>>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>>>>>
>>>>>                   <analyzer type="index">
>>>>>
>>>>>                       <tokenizer
>>>>>                       class="solr.WhitespaceTokenizerFactory"/>  <filter
>>>>>                       class="solr.WordDelimiterFilterFactory"
>>>>>
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>
>>>>>                       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>                       <filter class="solr.KeywordMarkerFilterFactory"/>
>>>>>                       <filter class="solr.PorterStemFilterFactory"/>
>>>>>
>>>>>                   </analyzer>
>>>>>                   <analyzer type="query">
>>>>>
>>>>>                       <tokenizer
>>>>>                       class="solr.WhitespaceTokenizerFactory"/>  <filter
>>>>>                       class="solr.WordDelimiterFilterFactory"
>>>>>
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>
>>>>>                       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>                       <filter class="solr.KeywordMarkerFilterFactory"/>
>>>>>                       <filter class="solr.PorterStemFilterFactory"/>
>>>>>
>>>>>                   </analyzer>
>>>>>                   <analyzer>
>>>>>
>>>>>                       <charFilter
>>>>>                       class="solr.HTMLStripCharFilterFactory"/>
>>>>>
>>>>>                        <tokenizer
>>>>>                        class="solr.WhitespaceTokenizerFactory"/>
>>>>>
>>>>>                   </analyzer>
>>>>>
>>>>>            </fieldType>
>>>>>
>>>>>       <fields>
>>>>>
>>>>>           <field name="text" type="text" indexed="true" stored="true"
>>>>>
>>>>> required="false"/>
>>>>>
>>>>>       </fields>
>>>>>
>>>>> Unfortunatelly this does not work, the hmtl tags like<h3>   are still
>>>>> present after restarting and reindexing. I also tryed
>>>>> htmlstriptransformer, but this did not work either.
>>>>>
>>>>> Has anybody an idea how to get this done? Thank you in advance for any
>>>>> hint.
>>>>>
>>>>> Merlin
>>>>>            
>>>> --
>>>> Markus Jelsma - CTO - Openindex
>>>> http://www.linkedin.com/in/markus17
>>>> 050-8536620 / 06-50258350
>>>>          
>

Re: strip html from data

Posted by Markus Jelsma <ma...@openindex.io>.

charFilters are executed first regardless of their position in the analyzer.

On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
> I think you need to list the charfilter earlier in the analysis chain;
> before the tokenizer.  Porbably Solr should tell you this...
> 
> -Mike
> 
> On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
> > sounds logical. I just changed it to the following, restarted and
> > reindexed
> > 
> > with commit:
> >           <fieldType name="text" class="solr.TextField"
> > 
> > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> > 
> >                  <analyzer type="index">
> >                  
> >                      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                      <filter class="solr.WordDelimiterFilterFactory"
> > 
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > 
> >                      <filter class="solr.LowerCaseFilterFactory"/>
> >                      <filter class="solr.KeywordMarkerFilterFactory"/>
> >                      <filter class="solr.PorterStemFilterFactory"/>
> >                      <charFilter
> >                      class="solr.HTMLStripCharFilterFactory"/>
> >                  
> >                  </analyzer>
> >                  <analyzer type="query">
> >                  
> >                      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                      <filter class="solr.WordDelimiterFilterFactory"
> > 
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > 
> >                      <filter class="solr.LowerCaseFilterFactory"/>
> >                      <filter class="solr.KeywordMarkerFilterFactory"/>
> >                      <filter class="solr.PorterStemFilterFactory"/>
> >                      <charFilter
> >                      class="solr.HTMLStripCharFilterFactory"/>
> >                  
> >                  </analyzer>
> >           
> >           </fieldType>
> > 
> > Unfortunatelly that did not fix the error. There are still<h3>  tags
> > inside the data. Although I believe there are viewer then before but I
> > can not prove that. Fact is, there are still html tags inside the data.
> > 
> > Any other ideas what the problem could be?
> > 
> > 
> > 
> > 
> > 
> > 2011/7/25 Markus Jelsma<ma...@openindex.io>
> > 
> >> You've three analyzer elements, i wonder what that would do. You need to
> >> add
> >> the char filter to the index-time analyzer.
> >> 
> >> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
> >>> Hi there,
> >>> 
> >>> I am trying to strip html tags from the data before adding the
> >>> documents
> >> 
> >> to
> >> 
> >>> the index. To do that I altered schem.xml like this:
> >>>           <fieldType name="text" class="solr.TextField"
> >>> 
> >>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >>> 
> >>>                  <analyzer type="index">
> >>>                  
> >>>                      <tokenizer
> >>>                      class="solr.WhitespaceTokenizerFactory"/> <filter
> >>>                      class="solr.WordDelimiterFilterFactory"
> >>> 
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>> 
> >>>                      <filter class="solr.LowerCaseFilterFactory"/>
> >>>                      <filter class="solr.KeywordMarkerFilterFactory"/>
> >>>                      <filter class="solr.PorterStemFilterFactory"/>
> >>>                  
> >>>                  </analyzer>
> >>>                  <analyzer type="query">
> >>>                  
> >>>                      <tokenizer
> >>>                      class="solr.WhitespaceTokenizerFactory"/> <filter
> >>>                      class="solr.WordDelimiterFilterFactory"
> >>> 
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>> 
> >>>                      <filter class="solr.LowerCaseFilterFactory"/>
> >>>                      <filter class="solr.KeywordMarkerFilterFactory"/>
> >>>                      <filter class="solr.PorterStemFilterFactory"/>
> >>>                  
> >>>                  </analyzer>
> >>>                  <analyzer>
> >>>                  
> >>>                      <charFilter
> >>>                      class="solr.HTMLStripCharFilterFactory"/>
> >>>                      
> >>>                       <tokenizer
> >>>                       class="solr.WhitespaceTokenizerFactory"/>
> >>>                  
> >>>                  </analyzer>
> >>>           
> >>>           </fieldType>
> >>>      
> >>>      <fields>
> >>>      
> >>>          <field name="text" type="text" indexed="true" stored="true"
> >>> 
> >>> required="false"/>
> >>> 
> >>>      </fields>
> >>> 
> >>> Unfortunatelly this does not work, the hmtl tags like<h3>  are still
> >>> present after restarting and reindexing. I also tryed
> >>> htmlstriptransformer, but this did not work either.
> >>> 
> >>> Has anybody an idea how to get this done? Thank you in advance for any
> >>> hint.
> >>> 
> >>> Merlin
> >> 
> >> --
> >> Markus Jelsma - CTO - Openindex
> >> http://www.linkedin.com/in/markus17
> >> 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: strip html from data

Posted by Mike Sokolov <so...@ifactory.com>.

I think you need to list the charfilter earlier in the analysis chain; 
before the tokenizer.  Porbably Solr should tell you this...

-Mike

On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
> sounds logical. I just changed it to the following, restarted and reindexed
> with commit:
>
>           <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>                  <analyzer type="index">
>                      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                      <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>                      <filter class="solr.LowerCaseFilterFactory"/>
>                      <filter class="solr.KeywordMarkerFilterFactory"/>
>                      <filter class="solr.PorterStemFilterFactory"/>
>                      <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                  </analyzer>
>                  <analyzer type="query">
>                      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                      <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>                      <filter class="solr.LowerCaseFilterFactory"/>
>                      <filter class="solr.KeywordMarkerFilterFactory"/>
>                      <filter class="solr.PorterStemFilterFactory"/>
>                      <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                  </analyzer>
>           </fieldType>
>
> Unfortunatelly that did not fix the error. There are still<h3>  tags inside
> the data. Although I believe there are viewer then before but I can not
> prove that. Fact is, there are still html tags inside the data.
>
> Any other ideas what the problem could be?
>
>
>
>
>
> 2011/7/25 Markus Jelsma<ma...@openindex.io>
>
>    
>> You've three analyzer elements, i wonder what that would do. You need to
>> add
>> the char filter to the index-time analyzer.
>>
>> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
>>      
>>> Hi there,
>>>
>>> I am trying to strip html tags from the data before adding the documents
>>>        
>> to
>>      
>>> the index. To do that I altered schem.xml like this:
>>>
>>>           <fieldType name="text" class="solr.TextField"
>>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>>>                  <analyzer type="index">
>>>                      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>                      <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>                      <filter class="solr.LowerCaseFilterFactory"/>
>>>                      <filter class="solr.KeywordMarkerFilterFactory"/>
>>>                      <filter class="solr.PorterStemFilterFactory"/>
>>>                  </analyzer>
>>>                  <analyzer type="query">
>>>                      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>                      <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>                      <filter class="solr.LowerCaseFilterFactory"/>
>>>                      <filter class="solr.KeywordMarkerFilterFactory"/>
>>>                      <filter class="solr.PorterStemFilterFactory"/>
>>>                  </analyzer>
>>>                  <analyzer>
>>>                      <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>>                       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>                  </analyzer>
>>>           </fieldType>
>>>
>>>      <fields>
>>>          <field name="text" type="text" indexed="true" stored="true"
>>> required="false"/>
>>>      </fields>
>>>
>>> Unfortunatelly this does not work, the hmtl tags like<h3>  are still
>>> present after restarting and reindexing. I also tryed
>>> htmlstriptransformer, but this did not work either.
>>>
>>> Has anybody an idea how to get this done? Thank you in advance for any
>>> hint.
>>>
>>> Merlin
>>>        
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>>      
>

Re: strip html from data

Posted by Markus Jelsma <ma...@openindex.io>.

Are you looking at the returned result set or what you've actually indexed? 
Analyzers are not run on the stored data, only on indexed data.

On Monday 25 July 2011 15:03:18 Merlin Morgenstern wrote:
> sounds logical. I just changed it to the following, restarted and reindexed
> with commit:
> 
>          <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>                 <analyzer type="index">
>                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                     <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>                     <filter class="solr.LowerCaseFilterFactory"/>
>                     <filter class="solr.KeywordMarkerFilterFactory"/>
>                     <filter class="solr.PorterStemFilterFactory"/>
>                     <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                 </analyzer>
>                 <analyzer type="query">
>                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                     <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>                     <filter class="solr.LowerCaseFilterFactory"/>
>                     <filter class="solr.KeywordMarkerFilterFactory"/>
>                     <filter class="solr.PorterStemFilterFactory"/>
>                     <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                 </analyzer>
>          </fieldType>
> 
> Unfortunatelly that did not fix the error. There are still <h3> tags inside
> the data. Although I believe there are viewer then before but I can not
> prove that. Fact is, there are still html tags inside the data.
> 
> Any other ideas what the problem could be?
> 
> 
> 
> 
> 
> 2011/7/25 Markus Jelsma <ma...@openindex.io>
> 
> > You've three analyzer elements, i wonder what that would do. You need to
> > add
> > the char filter to the index-time analyzer.
> > 
> > On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
> > > Hi there,
> > > 
> > > I am trying to strip html tags from the data before adding the
> > > documents
> > 
> > to
> > 
> > > the index. To do that I altered schem.xml like this:
> > >          <fieldType name="text" class="solr.TextField"
> > > 
> > > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> > > 
> > >                 <analyzer type="index">
> > >                 
> > >                     <tokenizer
> > >                     class="solr.WhitespaceTokenizerFactory"/> <filter
> > >                     class="solr.WordDelimiterFilterFactory"
> > > 
> > > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > > 
> > >                     <filter class="solr.LowerCaseFilterFactory"/>
> > >                     <filter class="solr.KeywordMarkerFilterFactory"/>
> > >                     <filter class="solr.PorterStemFilterFactory"/>
> > >                 
> > >                 </analyzer>
> > >                 <analyzer type="query">
> > >                 
> > >                     <tokenizer
> > >                     class="solr.WhitespaceTokenizerFactory"/> <filter
> > >                     class="solr.WordDelimiterFilterFactory"
> > > 
> > > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > > 
> > >                     <filter class="solr.LowerCaseFilterFactory"/>
> > >                     <filter class="solr.KeywordMarkerFilterFactory"/>
> > >                     <filter class="solr.PorterStemFilterFactory"/>
> > >                 
> > >                 </analyzer>
> > >                 <analyzer>
> > >                 
> > >                     <charFilter
> > >                     class="solr.HTMLStripCharFilterFactory"/>
> > >                     
> > >                      <tokenizer
> > >                      class="solr.WhitespaceTokenizerFactory"/>
> > >                 
> > >                 </analyzer>
> > >          
> > >          </fieldType>
> > >     
> > >     <fields>
> > >     
> > >         <field name="text" type="text" indexed="true" stored="true"
> > > 
> > > required="false"/>
> > > 
> > >     </fields>
> > > 
> > > Unfortunatelly this does not work, the hmtl tags like <h3> are still
> > > present after restarting and reindexing. I also tryed
> > > htmlstriptransformer, but this did not work either.
> > > 
> > > Has anybody an idea how to get this done? Thank you in advance for any
> > > hint.
> > > 
> > > Merlin
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: strip html from data

Posted by Merlin Morgenstern <me...@googlemail.com>.

sounds logical. I just changed it to the following, restarted and reindexed
with commit:

         <fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
                <analyzer type="index">
                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                    <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                    <filter class="solr.LowerCaseFilterFactory"/>
                    <filter class="solr.KeywordMarkerFilterFactory"/>
                    <filter class="solr.PorterStemFilterFactory"/>
                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
                </analyzer>
                <analyzer type="query">
                    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                    <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                    <filter class="solr.LowerCaseFilterFactory"/>
                    <filter class="solr.KeywordMarkerFilterFactory"/>
                    <filter class="solr.PorterStemFilterFactory"/>
                    <charFilter class="solr.HTMLStripCharFilterFactory"/>
                </analyzer>
         </fieldType>

Unfortunatelly that did not fix the error. There are still <h3> tags inside
the data. Although I believe there are viewer then before but I can not
prove that. Fact is, there are still html tags inside the data.

Any other ideas what the problem could be?





2011/7/25 Markus Jelsma <ma...@openindex.io>

> You've three analyzer elements, i wonder what that would do. You need to
> add
> the char filter to the index-time analyzer.
>
> On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
> > Hi there,
> >
> > I am trying to strip html tags from the data before adding the documents
> to
> > the index. To do that I altered schem.xml like this:
> >
> >          <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >                 <analyzer type="index">
> >                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                     <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >                     <filter class="solr.LowerCaseFilterFactory"/>
> >                     <filter class="solr.KeywordMarkerFilterFactory"/>
> >                     <filter class="solr.PorterStemFilterFactory"/>
> >                 </analyzer>
> >                 <analyzer type="query">
> >                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                     <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >                     <filter class="solr.LowerCaseFilterFactory"/>
> >                     <filter class="solr.KeywordMarkerFilterFactory"/>
> >                     <filter class="solr.PorterStemFilterFactory"/>
> >                 </analyzer>
> >                 <analyzer>
> >                     <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >                      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >                 </analyzer>
> >          </fieldType>
> >
> >     <fields>
> >         <field name="text" type="text" indexed="true" stored="true"
> > required="false"/>
> >     </fields>
> >
> > Unfortunatelly this does not work, the hmtl tags like <h3> are still
> > present after restarting and reindexing. I also tryed
> > htmlstriptransformer, but this did not work either.
> >
> > Has anybody an idea how to get this done? Thank you in advance for any
> > hint.
> >
> > Merlin
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: strip html from data

Posted by Markus Jelsma <ma...@openindex.io>.

You've three analyzer elements, i wonder what that would do. You need to add 
the char filter to the index-time analyzer.

On Monday 25 July 2011 13:09:14 Merlin Morgenstern wrote:
> Hi there,
> 
> I am trying to strip html tags from the data before adding the documents to
> the index. To do that I altered schem.xml like this:
> 
>          <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>                 <analyzer type="index">
>                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                     <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>                     <filter class="solr.LowerCaseFilterFactory"/>
>                     <filter class="solr.KeywordMarkerFilterFactory"/>
>                     <filter class="solr.PorterStemFilterFactory"/>
>                 </analyzer>
>                 <analyzer type="query">
>                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                     <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>                     <filter class="solr.LowerCaseFilterFactory"/>
>                     <filter class="solr.KeywordMarkerFilterFactory"/>
>                     <filter class="solr.PorterStemFilterFactory"/>
>                 </analyzer>
>                 <analyzer>
>                     <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 </analyzer>
>          </fieldType>
> 
>     <fields>
>         <field name="text" type="text" indexed="true" stored="true"
> required="false"/>
>     </fields>
> 
> Unfortunatelly this does not work, the hmtl tags like <h3> are still
> present after restarting and reindexing. I also tryed
> htmlstriptransformer, but this did not work either.
> 
> Has anybody an idea how to get this done? Thank you in advance for any
> hint.
> 
> Merlin

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350