You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2013/04/19 14:41:16 UTC

[jira] [Commented] (CONNECTORS-675) MCF-ES fails to escape json correctly

    [ https://issues.apache.org/jira/browse/CONNECTORS-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636323#comment-13636323 ] 

Karl Wright commented on CONNECTORS-675:
----------------------------------------

r1469806 .

Please synch up trunk and try this fix.  It should work but obviously let me know if there are any problems.

                
> MCF-ES fails to escape json correctly
> -------------------------------------
>
>                 Key: CONNECTORS-675
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-675
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector
>    Affects Versions: ManifoldCF 1.2
>         Environment: MCF 1.2-SNAPSHOT running on Win2008R2.
> java version "1.7.0_15"
> Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
> ----------------
> elasticsearch 0.90.0rc2 on ubuntu 12.10
> java version "1.7.0_15"
> Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
> -----------------
> Repository Connection: FileSystem
> Output Connection: ElasticSearch
>            Reporter: konrad
>            Assignee: Karl Wright
>             Fix For: ManifoldCF 1.2
>
>
> When crawling filesystem to elasticsearch, the generated json contains invalid utf-8 sequences. This causes elasticsearch to fail the index operation. 
> Stacktrace from elasticsearch:
> {noformat}
> [2013-04-19 13:17:38,952][DEBUG][action.index] [Lighting Rod] [eses2][0], node[Ycj8DEZMQFuX7Gn2sSCUXw],
> [P], s[STARTED]: Failed to execute [index 
> {[eses][attachment][file:/C:/indexdir/Lüneburg/somefile],
> source[{"uri" : "C:\\indexdir\\L�neburg\\somefile", 
> "allow_token_document" :
> "__nosecurity__","deny_token_document" : "__nosecurity__","allow_token_share" : "__nosecurity__","deny_token_share" :
> "__nosecurity__","type" : "attachment","_name" : "collection.pickle","file" : "KGRwMQp.....
> org.elasticsearch.index.mapper.MapperParsingException: failed to parse [uri]
> at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:395)
> at org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:599)
> at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:467)
> at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:506)
> at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:450)
> at org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:326)
> at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:203)
> at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:532)
> at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:430)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:722)
> Caused by: org.elasticsearch.common.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xfc
> at [Source: [B@56c77e95; line: 1, column: 254]
> at org.elasticsearch.common.jackson.core.JsonParser._constructError(JsonParser.java:1378)
> at org.elasticsearch.common.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:599)
> at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3008)
> at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:3002)
> at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2165)
> at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2092)
> at org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:275)
> at org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:85)
> at org.elasticsearch.common.xcontent.support.AbstractXContentParser.textOrNull(AbstractXContentParser.java:107)
> at org.elasticsearch.index.mapper.core.StringFieldMapper.parseCreateField(StringFieldMapper.java:286)
> at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:384)
> ... 11 more
> {noformat}
> In this case it is a german umlaut 'ü', but since ElasticSearchIndex#jsonStringEscape() doesn't do much more than escaping backslashes, I assume this affects a wider range of encoding specialities.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira