You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ryan Williams (JIRA)" <ji...@apache.org> on 2014/11/12 03:27:33 UTC

[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

    [ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207568#comment-14207568 ] 

Ryan Williams commented on SPARK-3630:
--------------------------------------

I'm seeing many Snappy {{FAILED_TO_UNCOMPRESS(5)}} and {{PARSING_ERROR(2)}} errors. I just built Spark yesterday off of [227488d|https://github.com/apache/spark/commit/227488d], so I expected that to have picked up some of the fixes detailed in this thread. I am running on a Yarn cluster whose 100 nodes have kernel 2.6.32 so in a few of these attempts I to used {{spark.file.transferTo=false}} and still saw these errors.

Here are some notes about some of my runs, along with the stdout I got:
* 1000 partitions, {{spark.file.transferTo=false}}: [stdout|https://www.dropbox.com/s/141keqpojucfbai/logs.1000?dl=0]. This was my latest run; it took a while to get to my reduceByKeyLocally stage, and immediately upon finishing the preceding stage it emitted ~190K {{FetchFailure}}s over ~200 attempts of the stage in about one minute, followed by some Snappy errors and the job shutting down.
* 2000 partitions, {{spark.file.transferTo=false}}: [stdout|https://www.dropbox.com/s/jr1dsldodq4rvbz/logs.2000?dl=0]. This one had ~150 FetchFailures out of the gate, 
seemingly ran fine for ~8mins, then had a futures timeout, seemingly ran find for another ~17m, then got to my reduceByKeyLocally stage and died from Snappy errors.
* 2000 partitions, {{spark.file.transferTo=true}}: [stdout|https://www.dropbox.com/s/9n24ffcdq0j43ue/logs.2000.tt?dl=0]. Before running the above two, I was hoping that {{spark.file.transferTo=false}} was going to fix my problems, so I ran this to see whether >2000 partitions was the determining factor in the Snappy errors happening, as [~joshrosen] suggested in this thread. No such luck! ~15 FetchFailures right away, ran fine for 24mins, got to reduceByKeyLocally phase, Snappy-failed and died.
* these and other stdout logs can be found [here|https://www.dropbox.com/sh/pn0bik3tvy73wfi/AAByFlQVJ3QUOqiKYKXt31RGa?dl=0]

In all of these I was running on a dataset (~170GB) that should be easily handled by my cluster (5TB RAM total), and in fact I successfully ran this job against this dataset last night using a Spark 1.1 build. That job was dying of FetchFailures when I tried to run against a larger dataset (~300GB), and I thought maybe I needed shuffle sorting or external shuffle service, or other 1.2.0 goodies, so I've been trying to run with 1.2.0 but can't get anything to finish.

This job reads a file in from hadoop, coalesces to the number of partitions I've asked for, and does a {{flatMap}}, a {reduceByKey}}, a map, and a {{reduceByKeyLocally}}. I am pretty confident that the {{Map}} I'm materializing onto the driver in the {{reduceByKeyLocally}} is a reasonable size; it's a {{Map[Long, Long]}} with about 40K entries, and I've actually successfully run this job on this data to materialize that exact map at different points this week, as I mentioned before. Something causes this job to die almost immediately upon starting the {{reduceByKeyLocally}} phase, however, usually just with Snappy errors, but with a preponderance of FetchFailures preceding them in my last attempt.

Let me know what other information I can provide that might be useful. Thanks!

> Identify cause of Kryo+Snappy PARSING_ERROR
> -------------------------------------------
>
>                 Key: SPARK-3630
>                 URL: https://issues.apache.org/jira/browse/SPARK-3630
>             Project: Spark
>          Issue Type: Task
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: Andrew Ash
>            Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) 
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) 
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org