You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Balakumar iyer S <ba...@gmail.com> on 2019/07/22 10:57:30 UTC

Spark 2.3 Dataframe Grouby operation throws IllegalArgumentException on Large dataset

Hi ,

I am trying to perform a group by  followed by aggregate collect set
operation on a two column data-set    schema (LeftData int , RightData
int).

code snippet

  val wind_2  =
dframe.groupBy("LeftData").agg(collect_set(array("RightData")))

     wind_2.write.mode(SaveMode.Append).format("orc").save(args(1))

the above code works fine on a smaller dataset but throws the following
error on large dataset (where each keys in LeftData column  needs to be
grouped with 64k values approximately ).

Could some one assist me on this , should i  set any configuration to
accommodate such a large  values?

ERROR
---------------------------------
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)


Caused by: org.apache.spark.SparkException: Task failed while writing rows.
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)

-- 
REGARDS
BALAKUMAR SEETHARAMAN

Re: Spark 2.3 Dataframe Grouby operation throws IllegalArgumentException on Large dataset

Posted by Chris Teoh <ch...@gmail.com>.
This might be a hint. Maybe invalid data?

Caused by: java.lang.IllegalArgumentException: Missing required char
':' at 'struct<LeftData:int,collect_set^(RightData):array<int>>'


On Wed., 24 Jul. 2019, 2:15 pm Balakumar iyer S, <ba...@gmail.com>
wrote:

> Hi Bobby Evans,
>
> I apologise for the delayed response , yes you are right I missed out to
> paste the complete stack trace exception. Here with I have attached the
> complete yarn log for the same.
>
> Thank you , It would be helpful if you guys could assist me on this error.
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------
> Regards
> Balakumar Seetharaman
>
>
> On Mon, Jul 22, 2019 at 7:05 PM Bobby Evans <bo...@apache.org> wrote:
>
>> You are missing a lot of the stack trace that could explain the
>> exception.  All it shows is that an exception happened while writing out
>> the orc file, not what that underlying exception is, there should be at
>> least one more caused by under the one you included.
>>
>> Thanks,
>>
>> Bobby
>>
>> On Mon, Jul 22, 2019 at 5:58 AM Balakumar iyer S <ba...@gmail.com>
>> wrote:
>>
>>> Hi ,
>>>
>>> I am trying to perform a group by  followed by aggregate collect set
>>> operation on a two column data-set    schema (LeftData int , RightData
>>> int).
>>>
>>> code snippet
>>>
>>>   val wind_2  =
>>> dframe.groupBy("LeftData").agg(collect_set(array("RightData")))
>>>
>>>      wind_2.write.mode(SaveMode.Append).format("orc").save(args(1))
>>>
>>> the above code works fine on a smaller dataset but throws the following
>>> error on large dataset (where each keys in LeftData column  needs to be
>>> grouped with 64k values approximately ).
>>>
>>> Could some one assist me on this , should i  set any configuration to
>>> accommodate such a large  values?
>>>
>>> ERROR
>>> ---------------------------------
>>> Driver stacktrace:
>>> at org.apache.spark.scheduler.DAGScheduler.org
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
>>> at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>>> at scala.Option.foreach(Option.scala:257)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
>>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>>> at
>>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>>>
>>>
>>> Caused by: org.apache.spark.SparkException: Task failed while writing
>>> rows.
>>> at
>>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>>>
>>> --
>>> REGARDS
>>> BALAKUMAR SEETHARAMAN
>>>
>>>
>
> --
> REGARDS
> BALAKUMAR SEETHARAMAN
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark 2.3 Dataframe Grouby operation throws IllegalArgumentException on Large dataset

Posted by Balakumar iyer S <ba...@gmail.com>.
Hi Bobby Evans,

I apologise for the delayed response , yes you are right I missed out to
paste the complete stack trace exception. Here with I have attached the
complete yarn log for the same.

Thank you , It would be helpful if you guys could assist me on this error.

-----------------------------------------------------------------------------------------------------------------------------------------
Regards
Balakumar Seetharaman


On Mon, Jul 22, 2019 at 7:05 PM Bobby Evans <bo...@apache.org> wrote:

> You are missing a lot of the stack trace that could explain the
> exception.  All it shows is that an exception happened while writing out
> the orc file, not what that underlying exception is, there should be at
> least one more caused by under the one you included.
>
> Thanks,
>
> Bobby
>
> On Mon, Jul 22, 2019 at 5:58 AM Balakumar iyer S <ba...@gmail.com>
> wrote:
>
>> Hi ,
>>
>> I am trying to perform a group by  followed by aggregate collect set
>> operation on a two column data-set    schema (LeftData int , RightData
>> int).
>>
>> code snippet
>>
>>   val wind_2  =
>> dframe.groupBy("LeftData").agg(collect_set(array("RightData")))
>>
>>      wind_2.write.mode(SaveMode.Append).format("orc").save(args(1))
>>
>> the above code works fine on a smaller dataset but throws the following
>> error on large dataset (where each keys in LeftData column  needs to be
>> grouped with 64k values approximately ).
>>
>> Could some one assist me on this , should i  set any configuration to
>> accommodate such a large  values?
>>
>> ERROR
>> ---------------------------------
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>> at scala.Option.foreach(Option.scala:257)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>> at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>>
>>
>> Caused by: org.apache.spark.SparkException: Task failed while writing
>> rows.
>> at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>>
>> --
>> REGARDS
>> BALAKUMAR SEETHARAMAN
>>
>>

-- 
REGARDS
BALAKUMAR SEETHARAMAN

Re: Spark 2.3 Dataframe Grouby operation throws IllegalArgumentException on Large dataset

Posted by Bobby Evans <bo...@apache.org>.
You are missing a lot of the stack trace that could explain the exception.
All it shows is that an exception happened while writing out the orc file,
not what that underlying exception is, there should be at least one more
caused by under the one you included.

Thanks,

Bobby

On Mon, Jul 22, 2019 at 5:58 AM Balakumar iyer S <ba...@gmail.com>
wrote:

> Hi ,
>
> I am trying to perform a group by  followed by aggregate collect set
> operation on a two column data-set    schema (LeftData int , RightData
> int).
>
> code snippet
>
>   val wind_2  =
> dframe.groupBy("LeftData").agg(collect_set(array("RightData")))
>
>      wind_2.write.mode(SaveMode.Append).format("orc").save(args(1))
>
> the above code works fine on a smaller dataset but throws the following
> error on large dataset (where each keys in LeftData column  needs to be
> grouped with 64k values approximately ).
>
> Could some one assist me on this , should i  set any configuration to
> accommodate such a large  values?
>
> ERROR
> ---------------------------------
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> at scala.Option.foreach(Option.scala:257)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>
>
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>
> --
> REGARDS
> BALAKUMAR SEETHARAMAN
>
>