You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "michael davis tira (JIRA)" <ji...@apache.org> on 2016/09/27 08:47:20 UTC

[jira] [Created] (SPARK-17684) 'null' appears in the data during aggregateByKey action.

michael davis tira created SPARK-17684:
------------------------------------------

             Summary: 'null' appears in the data during aggregateByKey action.
                 Key: SPARK-17684
                 URL: https://issues.apache.org/jira/browse/SPARK-17684
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.0.0, 1.6.2
         Environment: Local environment, virtual box VM running Ubuntu 16.04 with 8 GB of ram. 
            Reporter: michael davis tira


aggregateByKey issues an unexpected scala.MatchError. 

The MatchError is triggered in the merging function (the third parameter of aggregateByKey) because some of the records to be merged appear to be 'null':
bq. scala.MatchError: ((4c5b5fc8-6eb9-40b1-8e6c-a81e4c9869ce07fe38fc-abf2-43b0-b618-ff898cffbad6,1.0),null) (of class scala.Tuple2)
It is worth noting that no 'null' is actually present in the starting data. 
The problem only happens when the RDD's partitions are big enough to lead Spark to spill an in-memory map to disk:
bq. INFO ExternalSorter: Thread 64 spilling in-memory map of 373.6 MB to disk (1 time so far)
Repartitioning the RDD with a bigger number of partitions is effective to avoid the problem, nevertheless this behavior is worrying.

I'm using Kryo serializer and, for I guessed a serialization problem, I tried registering the HashMap class with no apparent effect.

Here is a simplified example I used to reproduce the problem:
[http://pastebin.com/rmPYb7Mu]

and here is Spark's stack trace:
[http://pastebin.com/gkxfYGdT]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org