You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "angerszhu (JIRA)" <ji...@apache.org> on 2019/08/13 14:35:00 UTC

[jira] [Comment Edited] (SPARK-28707) Spark SQL select query result size issue

    [ https://issues.apache.org/jira/browse/SPARK-28707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906265#comment-16906265 ] 

angerszhu edited comment on SPARK-28707 at 8/13/19 2:34 PM:
------------------------------------------------------------

[~S71955]

In case Limit ,
 # it will take( n ) of  each partition's data 
 # gather to one partition
 # take( n ) row.

For step 2, it won't return data to driver, only shuffle data 

For step 3, it will return data to driver, then check spark.driver.maxResultSize.

In this case not sum up each partition's size, but data size after gather to one partition

 

 


was (Author: angerszhuuu):
[~S71955]

In case Limit ,
 # it will take(n) of  each partition's data 
 # gather to one partition
 # take(n) row.

For step 2, it won't return data to driver, only shuffle data 

For step 3, it will return data to driver, then check spark.driver.maxResultSize.

In this case not sum up each partition's size, but data size after gather to one partition

 

 

> Spark SQL select query result size issue
> ----------------------------------------
>
>                 Key: SPARK-28707
>                 URL: https://issues.apache.org/jira/browse/SPARK-28707
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: jobit mathew
>            Priority: Major
>
> Spark sql select * from table; query fails with the validation of spark.driver.maxResultSize.
> But select * from table  limit 1000; pass with the same table data.
> *Test steps*
> spark.driver.maxResultSize=5120 in spark-default.conf
> 1.Create a table with more than 5KB size in my example 23KB text file with name consecutive2.txt
> local path /opt/jobit/consecutive2.txt
> AUS,30.33,CRICKET,1000
> AUS,30.33,CRICKET,1001
> --
> AUS,30.33,CRICKET,1999
> 2.launch spark-sql --master yarn
> 3.create table cons5(country String,avg float, sports String, year int) row format delimited fields terminated by ',';
> 4.load data local inpath '/opt/jobit/consecutive2.txt' into table cons5;
> 5. select count(*)from cons5; gives 1000;
> 6.select * from cons5 *limit 1000*;  query and displays  the 1000 data .*Not getting any error and query executing successfully.*
> 7. select * from cons5;
> getting the error as mentioned below.
> *ERROR*
> select * from cons5;
> *org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 2 tasks (7.5 KB) is bigger than spark.driver.maxResultSize (5.0 KB)*
>         at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
>         at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> *As per my observation limit query also should validate maxResultSize check if select * does.*



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org