You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Lee Carroll <le...@googlemail.com> on 2015/10/20 16:21:11 UTC
char filter factory and tokeniser issue in admin Analysis form
Hi,
on solr 4.7 I've ran into a strange issue. Whilst setting up a field I've
noticed in the analysis form when I use a char filter factory (for example
HTMLSCF) with a tokeniser (ST) the analysis chain grinds to a halt. the
char filter does not seem to pass anything into the tokeniser.
Field type is:
<fieldType name="clean_text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="English"/>
</analyzer>
</fieldType>
outpout of the analysis screen is:
Field value (index)
Content with mark up <br /> should be cleaned
HTMLSCF > Content with mark up should be cleaned
ST > <BLANK>
I know I must be missing something obvious !
Cheers Lee C
...
Re: char filter factory and tokeniser issue in admin Analysis form
Posted by Lee Carroll <le...@googlemail.com>.
No Alexandre its just Sod's law (http://www.thefreedictionary.com/Sod's+Law)
:-)
Lee C
On 20 October 2015 at 15:38, Alexandre Rafalovitch <ar...@gmail.com>
wrote:
> On 20 October 2015 at 10:26, Lee Carroll <le...@googlemail.com>
> wrote:
> > B*ll*cks, before posting I spent an hour searching for issues, honest.
> > Soon as I post within seconds I find
> >
> > https://issues.apache.org/jira/browse/SOLR-5800
>
> We are always glad to be of help. Including by RubberDucking:
> http://c2.com/cgi/wiki?RubberDucking
>
> Now remember the question that you asked yourself for that insight and
> remember to ask it next time. I suspect it was "4.7? I wonder if it is
> version-specific issue, since solved". I classify this under
> "Magnitude" in my presentation at Solr Revolution this past week:
> http://www.slideshare.net/arafalov/solr-troubleshooting-treemap-approach
> (slide 10).
>
> Regards,
> Alex.
>
>
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
Re: char filter factory and tokeniser issue in admin Analysis form
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
On 20 October 2015 at 10:26, Lee Carroll <le...@googlemail.com> wrote:
> B*ll*cks, before posting I spent an hour searching for issues, honest.
> Soon as I post within seconds I find
>
> https://issues.apache.org/jira/browse/SOLR-5800
We are always glad to be of help. Including by RubberDucking:
http://c2.com/cgi/wiki?RubberDucking
Now remember the question that you asked yourself for that insight and
remember to ask it next time. I suspect it was "4.7? I wonder if it is
version-specific issue, since solved". I classify this under
"Magnitude" in my presentation at Solr Revolution this past week:
http://www.slideshare.net/arafalov/solr-troubleshooting-treemap-approach
(slide 10).
Regards,
Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/
Re: char filter factory and tokeniser issue in admin Analysis form
Posted by Lee Carroll <le...@googlemail.com>.
B*ll*cks, before posting I spent an hour searching for issues, honest.
Soon as I post within seconds I find
https://issues.apache.org/jira/browse/SOLR-5800
On 20 October 2015 at 15:21, Lee Carroll <le...@googlemail.com>
wrote:
> Hi,
>
> on solr 4.7 I've ran into a strange issue. Whilst setting up a field I've
> noticed in the analysis form when I use a char filter factory (for example
> HTMLSCF) with a tokeniser (ST) the analysis chain grinds to a halt. the
> char filter does not seem to pass anything into the tokeniser.
>
> Field type is:
>
> <fieldType name="clean_text" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer>
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory"
> language="English"/>
> </analyzer>
> </fieldType>
>
> outpout of the analysis screen is:
>
> Field value (index)
> Content with mark up <br /> should be cleaned
>
> HTMLSCF > Content with mark up should be cleaned
> ST > <BLANK>
>
> I know I must be missing something obvious !
>
> Cheers Lee C
> ...
>