You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:16:33 UTC

[jira] [Resolved] (SPARK-17684) 'null' appears in the data during aggregateByKey action.

     [ https://issues.apache.org/jira/browse/SPARK-17684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-17684.
----------------------------------
    Resolution: Incomplete

> 'null' appears in the data during aggregateByKey action.
> --------------------------------------------------------
>
>                 Key: SPARK-17684
>                 URL: https://issues.apache.org/jira/browse/SPARK-17684
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.2, 2.0.0
>         Environment: Local environment, virtual box VM running Ubuntu 16.04 with 8 GB of ram.
> EMR 3 cores and 1 master m3.xlarge su emr-4.7.2 con spark 1.6.2 su hadoop 2.7.2 
>            Reporter: michael davis tira
>            Priority: Major
>              Labels: bulk-closed
>
> aggregateByKey issues an unexpected scala.MatchError. 
> The MatchError is triggered in the merging function (the third parameter of aggregateByKey) because some of the records to be merged appear to be 'null':
> bq. scala.MatchError: ((4c5b5fc8-6eb9-40b1-8e6c-a81e4c9869ce07fe38fc-abf2-43b0-b618-ff898cffbad6,1.0),null) (of class scala.Tuple2)
> It is worth noting that no 'null' is actually present in the starting data. 
> The problem only happens when the RDD's partitions are big enough to lead Spark to spill an in-memory map to disk:
> bq. INFO ExternalSorter: Thread 64 spilling in-memory map of 373.6 MB to disk (1 time so far)
> Repartitioning the RDD with a bigger number of partitions is effective to avoid the problem, nevertheless this behavior is worrying.
> I'm using Kryo serializer and, for I guessed a serialization problem, I tried registering the HashMap class with no apparent effect.
> Here is a simplified example I used to reproduce the problem:
> [http://pastebin.com/rmPYb7Mu]
> and here is Spark's stack trace:
> [http://pastebin.com/gkxfYGdT]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org