You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "michael davis tira (JIRA)" <ji...@apache.org> on 2016/09/27 08:47:20 UTC
[jira] [Created] (SPARK-17684) 'null' appears in the data during
aggregateByKey action.
michael davis tira created SPARK-17684:
------------------------------------------
Summary: 'null' appears in the data during aggregateByKey action.
Key: SPARK-17684
URL: https://issues.apache.org/jira/browse/SPARK-17684
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.0.0, 1.6.2
Environment: Local environment, virtual box VM running Ubuntu 16.04 with 8 GB of ram.
Reporter: michael davis tira
aggregateByKey issues an unexpected scala.MatchError.
The MatchError is triggered in the merging function (the third parameter of aggregateByKey) because some of the records to be merged appear to be 'null':
bq. scala.MatchError: ((4c5b5fc8-6eb9-40b1-8e6c-a81e4c9869ce07fe38fc-abf2-43b0-b618-ff898cffbad6,1.0),null) (of class scala.Tuple2)
It is worth noting that no 'null' is actually present in the starting data.
The problem only happens when the RDD's partitions are big enough to lead Spark to spill an in-memory map to disk:
bq. INFO ExternalSorter: Thread 64 spilling in-memory map of 373.6 MB to disk (1 time so far)
Repartitioning the RDD with a bigger number of partitions is effective to avoid the problem, nevertheless this behavior is worrying.
I'm using Kryo serializer and, for I guessed a serialization problem, I tried registering the HashMap class with no apparent effect.
Here is a simplified example I used to reproduce the problem:
[http://pastebin.com/rmPYb7Mu]
and here is Spark's stack trace:
[http://pastebin.com/gkxfYGdT]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org