You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Adnan Khan (JIRA)" <ji...@apache.org> on 2015/04/02 03:07:53 UTC

[jira] [Commented] (SPARK-6659) Spark SQL 1.3 cannot read json file that only with a record.

    [ https://issues.apache.org/jira/browse/SPARK-6659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391898#comment-14391898 ] 

Adnan Khan commented on SPARK-6659:
-----------------------------------

{quote}
Spark SQL 1.3 cannot read json file that only with a record.
here is my json file's content.
\{"name":"milo","age",24\}
{quote}

that's invalid json. there's a colon missing between "age" and 24.

i just tried it with valid json from a single record and it works. instead of {{df: org.apache.spark.sql.DataFrame = \[_corrupt_record: string\]}} you should see 
{{df: org.apache.spark.sql.DataFrame = \[age: bigint, name: string\]}}

> Spark SQL 1.3 cannot read json file that only with a record.
> ------------------------------------------------------------
>
>                 Key: SPARK-6659
>                 URL: https://issues.apache.org/jira/browse/SPARK-6659
>             Project: Spark
>          Issue Type: Bug
>            Reporter: luochenghui
>
> Dear friends:
>  
> Spark SQL 1.3 cannot read json file that only with a record.
> here is my json file's content.
> {"name":"milo","age",24}
>  
> when i run Spark SQL under the local mode,it throws an exception
> rg.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns _corrupt_record;
>  
> what i had done:
> 1  ./spark-shell
> 2 
> scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@5f3be6c8
>  
> scala> val df = sqlContext.jsonFile("/home/milo/person.json")
> 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=280248975
> 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 267.1 MB)
> 15/03/19 22:11:45 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=280248975
> 15/03/19 22:11:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 267.1 MB)
> 15/03/19 22:11:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:35842 (size: 22.2 KB, free: 267.2 MB)
> 15/03/19 22:11:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
> 15/03/19 22:11:45 INFO SparkContext: Created broadcast 0 from textFile at JSONRelation.scala:98
> 15/03/19 22:11:47 INFO FileInputFormat: Total input paths to process : 1
> 15/03/19 22:11:47 INFO SparkContext: Starting job: reduce at JsonRDD.scala:51
> 15/03/19 22:11:47 INFO DAGScheduler: Got job 0 (reduce at JsonRDD.scala:51) with 1 output partitions (allowLocal=false)
> 15/03/19 22:11:47 INFO DAGScheduler: Final stage: Stage 0(reduce at JsonRDD.scala:51)
> 15/03/19 22:11:47 INFO DAGScheduler: Parents of final stage: List()
> 15/03/19 22:11:47 INFO DAGScheduler: Missing parents: List()
> 15/03/19 22:11:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51), which has no missing parents
> 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(3184) called with curMem=186397, maxMem=280248975
> 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 267.1 MB)
> 15/03/19 22:11:47 INFO MemoryStore: ensureFreeSpace(2251) called with curMem=189581, maxMem=280248975
> 15/03/19 22:11:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 267.1 MB)
> 15/03/19 22:11:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:35842 (size: 2.2 KB, free: 267.2 MB)
> 15/03/19 22:11:47 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
> 15/03/19 22:11:47 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839
> 15/03/19 22:11:48 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[3] at map at JsonRDD.scala:51)
> 15/03/19 22:11:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
> 15/03/19 22:11:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1291 bytes)
> 15/03/19 22:11:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 15/03/19 22:11:48 INFO HadoopRDD: Input split: file:/home/milo/person.json:0+26
> 15/03/19 22:11:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
> 15/03/19 22:11:48 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
> 15/03/19 22:11:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
> 15/03/19 22:11:48 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
> 15/03/19 22:11:48 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
> 15/03/19 22:11:49 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2023 bytes result sent to driver
> 15/03/19 22:11:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1209 ms on localhost (1/1)
> 15/03/19 22:11:49 INFO DAGScheduler: Stage 0 (reduce at JsonRDD.scala:51) finished in 1.308 s
> 15/03/19 22:11:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
> 15/03/19 22:11:49 INFO DAGScheduler: Job 0 finished: reduce at JsonRDD.scala:51, took 2.002429 s
> df: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
>  
> 3  
> scala> df.select("name").show()
> 15/03/19 22:12:44 INFO BlockManager: Removing broadcast 1
> 15/03/19 22:12:44 INFO BlockManager: Removing block broadcast_1_piece0
> 15/03/19 22:12:44 INFO MemoryStore: Block broadcast_1_piece0 of size 2251 dropped from memory (free 280059394)
> 15/03/19 22:12:44 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:35842 in memory (size: 2.2 KB, free: 267.2 MB)
> 15/03/19 22:12:44 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
> 15/03/19 22:12:44 INFO BlockManager: Removing block broadcast_1
> 15/03/19 22:12:44 INFO MemoryStore: Block broadcast_1 of size 3184 dropped from memory (free 280062578)
> 15/03/19 22:12:45 INFO ContextCleaner: Cleaned broadcast 1
> 15/03/19 22:12:45 INFO BlockManager: Removing broadcast 0
> 15/03/19 22:12:45 INFO BlockManager: Removing block broadcast_0
> 15/03/19 22:12:45 INFO MemoryStore: Block broadcast_0 of size 163705 dropped from memory (free 280226283)
> 15/03/19 22:12:45 INFO BlockManager: Removing block broadcast_0_piece0
> 15/03/19 22:12:45 INFO MemoryStore: Block broadcast_0_piece0 of size 22692 dropped from memory (free 280248975)
> 15/03/19 22:12:45 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:35842 in memory (size: 22.2 KB, free: 267.3 MB)
> 15/03/19 22:12:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
> 15/03/19 22:12:45 INFO ContextCleaner: Cleaned broadcast 0
> org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns _corrupt_record;
>  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
>  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
>  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
>  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
>  at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
>  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
>  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>  at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>  at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:121)
>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:45)
>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:43)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:88)
>  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.apply(CheckAnalysis.scala:43)
>  at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1069)
>  at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:133)
>  at org.apache.spark.sql.DataFrame.logicalPlanToDataFrame(DataFrame.scala:157)
>  at org.apache.spark.sql.DataFrame.select(DataFrame.scala:465)
>  at org.apache.spark.sql.DataFrame.select(DataFrame.scala:480)
>  at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:26)
>  at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
>  at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
>  at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
>  at $iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
>  at $iwC$$iwC$$iwC.<init>(<console>:39)
>  at $iwC$$iwC.<init>(<console>:41)
>  at $iwC.<init>(<console>:43)
>  at <init>(<console>:45)
>  at .<init>(<console>:49)
>  at .<clinit>(<console>)
>  at .<init>(<console>:7)
>  at .<clinit>(<console>)
>  at $print(<console>)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606)
>  at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>  at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
>  at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>  at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>  at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>  at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
>  at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
>  at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
>  at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
>  at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
>  at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
>  at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
>  at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
>  at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
>  at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>  at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
>  at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
>  at org.apache.spark.repl.Main$.main(Main.scala:31)
>  at org.apache.spark.repl.Main.main(Main.scala)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606)
>  at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
>  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>  
> but i invoke df.show() ,it could work.
> scala> df.show()
> 15/03/19 22:13:32 INFO MemoryStore: ensureFreeSpace(81443) called with curMem=0, maxMem=280248975
> 15/03/19 22:13:32 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 79.5 KB, free 267.2 MB)
> 15/03/19 22:13:32 INFO MemoryStore: ensureFreeSpace(31262) called with curMem=81443, maxMem=280248975
> 15/03/19 22:13:32 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 30.5 KB, free 267.2 MB)
> 15/03/19 22:13:32 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:35842 (size: 30.5 KB, free: 267.2 MB)
> 15/03/19 22:13:32 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
> 15/03/19 22:13:32 INFO SparkContext: Created broadcast 2 from textFile at JSONRelation.scala:98
> 15/03/19 22:13:32 INFO FileInputFormat: Total input paths to process : 1
> 15/03/19 22:13:32 INFO SparkContext: Starting job: runJob at SparkPlan.scala:121
> 15/03/19 22:13:32 INFO DAGScheduler: Got job 1 (runJob at SparkPlan.scala:121) with 1 output partitions (allowLocal=false)
> 15/03/19 22:13:32 INFO DAGScheduler: Final stage: Stage 1(runJob at SparkPlan.scala:121)
> 15/03/19 22:13:32 INFO DAGScheduler: Parents of final stage: List()
> 15/03/19 22:13:32 INFO DAGScheduler: Missing parents: List()
> 15/03/19 22:13:32 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[8] at map at SparkPlan.scala:96), which has no missing parents
> 15/03/19 22:13:32 INFO MemoryStore: ensureFreeSpace(3968) called with curMem=112705, maxMem=280248975
> 15/03/19 22:13:32 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.9 KB, free 267.2 MB)
> 15/03/19 22:13:32 INFO MemoryStore: ensureFreeSpace(2724) called with curMem=116673, maxMem=280248975
> 15/03/19 22:13:32 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.7 KB, free 267.2 MB)
> 15/03/19 22:13:32 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:35842 (size: 2.7 KB, free: 267.2 MB)
> 15/03/19 22:13:32 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
> 15/03/19 22:13:32 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:839
> 15/03/19 22:13:32 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[8] at map at SparkPlan.scala:96)
> 15/03/19 22:13:32 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
> 15/03/19 22:13:32 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1291 bytes)
> 15/03/19 22:13:32 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
> 15/03/19 22:13:32 INFO HadoopRDD: Input split: file:/home/milo/person.json:0+26
> 15/03/19 22:13:33 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1968 bytes result sent to driver
> 15/03/19 22:13:33 INFO DAGScheduler: Stage 1 (runJob at SparkPlan.scala:121) finished in 0.249 s
> 15/03/19 22:13:33 INFO DAGScheduler: Job 1 finished: runJob at SparkPlan.scala:121, took 0.381798 s
> 15/03/19 22:13:33 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 242 ms on localhost (1/1)
> 15/03/19 22:13:33 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
> _corrupt_record     
> {"name":"milo","a...
>  
> And i tested another case with a json file more than one record,it ran success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org