You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Alexandre Rafalovitch (JIRA)" <ji...@apache.org> on 2018/06/30 03:31:00 UTC
[jira] [Commented] (SOLR-2834) AnalysisResponseBase.java doesn't
handle org.apache.solr.analysis.HTMLStripCharFilter
[ https://issues.apache.org/jira/browse/SOLR-2834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528526#comment-16528526 ]
Alexandre Rafalovitch commented on SOLR-2834:
---------------------------------------------
I can still see it with 7.4. The issue is that CharFilter (any I think) returns an unexpected sequence that does not have all the tokenInfo. For example, here is the start when running an analysis against _text_fa_ type:
{noformat}
"analysis":{
"field_types":{
"text_fa":{
"index":[
"org.apache.lucene.analysis.fa.PersianCharFilter","this is a test",
"org.apache.lucene.analysis.standard.StandardTokenizer",[{
"text":"this",
"raw_bytes":"[74 68 69 73]",
"start":0,
"end":4,
"org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength":1,
"type":"<ALPHANUM>",
"org.apache.lucene.analysis.tokenattributes.TermFrequencyAttribute#termFrequency":1,
"position":1,
"positionHistory":[1]},
{noformat}
The output of the _PersianCharFilter_ is then a pure string. In Admin interface, it shows as a character-by-character display. In the SolrJ code, it just dies with an exception because of:
{code:java}
// List<NamedList<Object>> tokens = phaseEntry.getValue();
{code}
The question is whether this should be fixed on the server side to emit a full tokenInfo for the parsed string. Or on both client-sides (and perhaps every other client) to deal with this exception case.
There does not seem to be a test for CharFilter either.
> AnalysisResponseBase.java doesn't handle org.apache.solr.analysis.HTMLStripCharFilter
> -------------------------------------------------------------------------------------
>
> Key: SOLR-2834
> URL: https://issues.apache.org/jira/browse/SOLR-2834
> Project: Solr
> Issue Type: Bug
> Components: clients - java, Schema and Analysis
> Affects Versions: 3.4, 3.6, 4.2, 7.4
> Reporter: Shane
> Assignee: Shalin Shekhar Mangar
> Priority: Minor
> Labels: patch
> Attachments: AnalysisResponseBase.patch, SOLR-2834.patch
>
> Original Estimate: 5m
> Remaining Estimate: 5m
>
> When using FieldAnalysisRequest.java to analysis a field, a ClassCastExcpetion is thrown if the schema defines the filter org.apache.solr.analysis.HTMLStripCharFilter. The exception is:
> java.lang.ClassCastException: java.lang.String cannot be cast to java.util.List
> at org.apache.solr.client.solrj.response.AnalysisResponseBase.buildPhases(AnalysisResponseBase.java:69)
> at org.apache.solr.client.solrj.response.FieldAnalysisResponse.setResponse(FieldAnalysisResponse.java:66)
> at org.apache.solr.client.solrj.request.FieldAnalysisRequest.process(FieldAnalysisRequest.java:107)
> My schema definition is:
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer>
> <charFilter class="solr.HTMLStripCharFilterFactory" />
> <tokenizer class="solr.StandardTokenizerFactory" />
> <filter class="solr.StandardFilterFactory" />
> <filter class="solr.TrimFilterFactory" />
> <filter class="solr.LowerCaseFilterFactory" />
> </analyzer>
> </fieldType>
> The response is part is:
> <lst name="query">
> <str name="org.apache.solr.analysis.HTMLStripCharFilter">testing analysis</str>
> <arr name="org.apache.lucene.analysis.standard.StandardTokenizer">
> <lst>...
> A simplistic fix would be to test if the Entry value is an instance of List.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org