You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "ShivaKumar SS (JIRA)" <ji...@apache.org> on 2019/07/24 19:04:00 UTC

[jira] [Commented] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data

    [ https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892098#comment-16892098 ] 

ShivaKumar SS commented on SPARK-16548:
---------------------------------------

Hello guys, 

I am using Spark 2.3.0 and facing the same issue, i have to process huge amount of files and job is getting failed due to 1 or 2 files. 

I have following code snippet to read the json files 

{{val df = sqlContext.read.schema(Schemas.request_01)}}
{{.json(inputPath1,inputPath2)}}

{{df.show(10)}}

 

 

Following is the exception which i am receiving. 

{{}}{{19/07/25 00:17:50 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 192.168.1.100, executor 0): java.io.CharConversionException: Invalid UTF-32 character 0x4d89aa(above 10ffff) at char #63, byte #255)}}
{{ at com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)}}
{{ at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)}}
{{ at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)}}
{{ at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2017)}}
{{ at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:577)}}
{{ at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:350)}}
{{ at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:347)}}
{{ at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2585)}}
{{ at org.apache.spark.sql.catalyst.json.JacksonParser.parse(JacksonParser.scala:347)}}
{{ at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$3.apply(JsonDataSource.scala:126)}}
{{ at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$3.apply(JsonDataSource.scala:126)}}
{{ at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)}}
{{ at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$readFile$2.apply(JsonDataSource.scala:130)}}
{{ at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$readFile$2.apply(JsonDataSource.scala:130)}}
{{ at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)}}
{{ at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)}}
{{ at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)}}
{{ at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)}}
{{ at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)}}
{{ at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)}}
{{ at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)}}
{{ at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)}}
{{ at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)}}
{{ at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)}}
{{ at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)}}
{{ at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)}}
{{ at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)}}
{{ at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)}}
{{ at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)}}
{{ at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)}}
{{ at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)}}
{{ at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)}}
{{ at org.apache.spark.scheduler.Task.run(Task.scala:109)}}
{{ at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)}}
{{ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
{{ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
{{ at java.lang.Thread.run(Thread.java:748)}}

 

Can any of you help me to ignore these json files and process with other valid files.  ?

 

Thanks

Shiva Kumar SS

 

 

 

 

 

 

 

 

 

 

 

> java.io.CharConversionException: Invalid UTF-32 character  prevents me from querying my data
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-16548
>                 URL: https://issues.apache.org/jira/browse/SPARK-16548
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Egor Pahomov
>            Priority: Minor
>             Fix For: 2.2.0, 2.3.0
>
>         Attachments: corrupted.json
>
>
> Basically, when I query my json data I get 
> {code}
> java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above 10ffff)  at char #192, byte #771)
> 	at com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
> 	at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
> 	at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
> 	at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855)
> 	at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571)
> 	at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142)
> {code}
> I do not like it. If you can not process one json among 100500 please return null, do not fail everything. I have dirty one line fix, and I understand how I can make it more reasonable. What is our position - what behaviour we wanna get?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org