You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "michael davis tira (JIRA)" <ji...@apache.org> on 2017/08/08 14:04:00 UTC

[jira] [Updated] (SPARK-17684) 'null' appears in the data during aggregateByKey action.

     [ https://issues.apache.org/jira/browse/SPARK-17684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

michael davis tira updated SPARK-17684:
---------------------------------------
    Environment: 
Local environment, virtual box VM running Ubuntu 16.04 with 8 GB of ram.

EMR 3 cores and 1 master m3.xlarge su emr-4.7.2 con spark 1.6.2 su hadoop 2.7.2 

  was:Local environment, virtual box VM running Ubuntu 16.04 with 8 GB of ram. 


> 'null' appears in the data during aggregateByKey action.
> --------------------------------------------------------
>
>                 Key: SPARK-17684
>                 URL: https://issues.apache.org/jira/browse/SPARK-17684
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.2, 2.0.0
>         Environment: Local environment, virtual box VM running Ubuntu 16.04 with 8 GB of ram.
> EMR 3 cores and 1 master m3.xlarge su emr-4.7.2 con spark 1.6.2 su hadoop 2.7.2 
>            Reporter: michael davis tira
>
> aggregateByKey issues an unexpected scala.MatchError. 
> The MatchError is triggered in the merging function (the third parameter of aggregateByKey) because some of the records to be merged appear to be 'null':
> bq. scala.MatchError: ((4c5b5fc8-6eb9-40b1-8e6c-a81e4c9869ce07fe38fc-abf2-43b0-b618-ff898cffbad6,1.0),null) (of class scala.Tuple2)
> It is worth noting that no 'null' is actually present in the starting data. 
> The problem only happens when the RDD's partitions are big enough to lead Spark to spill an in-memory map to disk:
> bq. INFO ExternalSorter: Thread 64 spilling in-memory map of 373.6 MB to disk (1 time so far)
> Repartitioning the RDD with a bigger number of partitions is effective to avoid the problem, nevertheless this behavior is worrying.
> I'm using Kryo serializer and, for I guessed a serialization problem, I tried registering the HashMap class with no apparent effect.
> Here is a simplified example I used to reproduce the problem:
> [http://pastebin.com/rmPYb7Mu]
> and here is Spark's stack trace:
> [http://pastebin.com/gkxfYGdT]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org