You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:21:23 UTC
[jira] [Updated] (SPARK-13654) get_json_object fails with
java.io.CharConversionException
[ https://issues.apache.org/jira/browse/SPARK-13654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-13654:
---------------------------------
Labels: bulk-closed (was: )
> get_json_object fails with java.io.CharConversionException
> ----------------------------------------------------------
>
> Key: SPARK-13654
> URL: https://issues.apache.org/jira/browse/SPARK-13654
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.1
> Reporter: Egor Pahomov
> Priority: Major
> Labels: bulk-closed
>
> I execute next query on my data:
> {code}
> select count(distinct get_json_object(regexp_extract(line, "^\\p{ASCII}*$", 0), '$.event')) from
> (select line from logs.raw_client_log where year=2016 and month=2 and day>28 and line rlike "^\\p{ASCII}*$" and line is not null) a
> {code}
> And it fails with
> {code}
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 420 in stage 168.0 failed 4 times, most recent failure: Lost task 420.3 in stage 168.0 (TID 13064, nod5-2-hadoop.anchorfree.net): java.io.CharConversionException: Invalid UTF-32 character 0x6576656e(above 10ffff) at char #47, byte #191)
> at com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
> at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
> at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
> at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855)
> at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571)
> at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142)
> at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:141)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2202)
> at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2.apply(jsonExpressions.scala:141)
> at org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2.apply(jsonExpressions.scala:138)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2202)
> at org.apache.spark.sql.catalyst.expressions.GetJsonObject.eval(jsonExpressions.scala:138)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source)
> at org.apache.spark.sql.execution.Expand$$anonfun$doExecute$1$$anonfun$3$$anon$1.next(Expand.scala:76)
> at org.apache.spark.sql.execution.Expand$$anonfun$doExecute$1$$anonfun$3$$anon$1.next(Expand.scala:62)
> at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
> at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
> at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
> at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> {code}
> Basically Spark sells me the idea, that I have character 敮 in my data. But query
> {code}
> select line from logs.raw_client_log where year=2016 and month=2 and day>27 and line rlike "敮"
> {code}
> returns nothing.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org