You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alexandre Rafalovitch (JIRA)" <ji...@apache.org> on 2018/06/30 03:31:00 UTC
[jira] [Commented] (SOLR-2834) AnalysisResponseBase.java doesn't handle org.apache.solr.analysis.HTMLStripCharFilter

    [ https://issues.apache.org/jira/browse/SOLR-2834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528526#comment-16528526 ] 

Alexandre Rafalovitch commented on SOLR-2834:
---------------------------------------------

I can still see it with 7.4. The issue is that CharFilter (any I think) returns an unexpected sequence that does not have all the tokenInfo. For example, here is the start when running an analysis against _text_fa_ type:
{noformat}
  "analysis":{
    "field_types":{
      "text_fa":{
        "index":[
          "org.apache.lucene.analysis.fa.PersianCharFilter","this is a test",
          "org.apache.lucene.analysis.standard.StandardTokenizer",[{
              "text":"this",
              "raw_bytes":"[74 68 69 73]",
              "start":0,
              "end":4,
              "org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength":1,
              "type":"<ALPHANUM>",
              "org.apache.lucene.analysis.tokenattributes.TermFrequencyAttribute#termFrequency":1,
              "position":1,
              "positionHistory":[1]},
{noformat}
The output of the _PersianCharFilter_ is then a pure string. In Admin interface, it shows as a character-by-character display. In the SolrJ code, it just dies with an exception because of:
{code:java}
// List<NamedList<Object>> tokens = phaseEntry.getValue();
{code}
The question is whether this should be fixed on the server side to emit a full tokenInfo for the parsed string. Or on both client-sides (and perhaps every other client) to deal with this exception case.

There does not seem to be a test for CharFilter either.

> AnalysisResponseBase.java doesn't handle org.apache.solr.analysis.HTMLStripCharFilter
> -------------------------------------------------------------------------------------
>
>                 Key: SOLR-2834
>                 URL: https://issues.apache.org/jira/browse/SOLR-2834
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java, Schema and Analysis
>    Affects Versions: 3.4, 3.6, 4.2, 7.4
>            Reporter: Shane
>            Assignee: Shalin Shekhar Mangar
>            Priority: Minor
>              Labels: patch
>         Attachments: AnalysisResponseBase.patch, SOLR-2834.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When using FieldAnalysisRequest.java to analysis a field, a ClassCastExcpetion is thrown if the schema defines the filter org.apache.solr.analysis.HTMLStripCharFilter.  The exception is:
> java.lang.ClassCastException: java.lang.String cannot be cast to java.util.List
>        at org.apache.solr.client.solrj.response.AnalysisResponseBase.buildPhases(AnalysisResponseBase.java:69)
>        at org.apache.solr.client.solrj.response.FieldAnalysisResponse.setResponse(FieldAnalysisResponse.java:66)
>        at org.apache.solr.client.solrj.request.FieldAnalysisRequest.process(FieldAnalysisRequest.java:107)
> My schema definition is:
>     <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>       <analyzer>
>         <charFilter class="solr.HTMLStripCharFilterFactory" />
>         <tokenizer class="solr.StandardTokenizerFactory" />
>         <filter class="solr.StandardFilterFactory" />
>         <filter class="solr.TrimFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory" />
>       </analyzer>
>     </fieldType>
> The response is part is:
>         <lst name="query">
>           <str name="org.apache.solr.analysis.HTMLStripCharFilter">testing analysis</str>
>           <arr name="org.apache.lucene.analysis.standard.StandardTokenizer">
>             <lst>...
> A simplistic fix would be to test if the Entry value is an instance of List.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org