You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kevin Conaway (JIRA)" <ji...@apache.org> on 2016/06/21 14:14:58 UTC

[jira] [Comment Edited] (SPARK-16087) Spark Hangs When Using Union With Persisted Hadoop RDD

    [ https://issues.apache.org/jira/browse/SPARK-16087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341793#comment-15341793 ] 

Kevin Conaway edited comment on SPARK-16087 at 6/21/16 2:14 PM:
----------------------------------------------------------------

{quote}
Are you actually using 1.4?
{quote}

Yes, unfortunately.  We are using a commercial Hadoop product in our cluster that only works with a specific version of HDP, which in turn has Spark 1.4.  We're hoping to alleviate this soon but are stuck with 1.4 for the time being.

That said, I'm just running this locally in my IDE.  I tried running with Spark 1.4.1 and Spark 1.6.1 and it hung with both versions.  I wonder if its specific to the Hadoop dependency?  I'm running with {{org.apache.hadoop:hadoop-client:2.7.1}} and my local hadoop distro is 2.7.2


was (Author: kevinconaway):
{quote}
Are you actually using 1.4?
{quote}

Yes, unfortunately.  We are using a commercial Hadoop product in our cluster that only works with a specific version of HDP, which in turn has Spark 1.4.  We're hoping to alleviate this soon but are stuck with 1.4 for the time being.

That said, I'm just running this locally in my IDE.  I tried running with Spark 1.4.1 and Spark 1.6.1 and it hung with both versions.  I wonder if its specific to the Hadoop dependency?  I'm running with `org.apache.hadoop:hadoop-client:2.7.1` and my local hadoop distro is 2.7.2

> Spark Hangs When Using Union With Persisted Hadoop RDD
> ------------------------------------------------------
>
>                 Key: SPARK-16087
>                 URL: https://issues.apache.org/jira/browse/SPARK-16087
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.4.1, 1.6.1
>            Reporter: Kevin Conaway
>            Priority: Critical
>         Attachments: SPARK-16087.dump.log, SPARK-16087.log, part-00000, part-00001, spark-16087.tar.gz
>
>
> Spark hangs when materializing a persisted RDD that was built from a Hadoop sequence file and then union-ed with a similar RDD.
> Below is a small file that exhibits the issue:
> {code:java}
> import org.apache.hadoop.io.BytesWritable;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.JavaPairRDD;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.apache.spark.api.java.function.PairFunction;
> import org.apache.spark.serializer.KryoSerializer;
> import org.apache.spark.storage.StorageLevel;
> import scala.Tuple2;
> public class SparkBug {
>     public static void main(String [] args) throws Exception {
>         JavaSparkContext sc = new JavaSparkContext(
>             new SparkConf()
>                 .set("spark.serializer", KryoSerializer.class.getName())
>                 .set("spark.master", "local[*]")
>                 .setAppName(SparkBug.class.getName())
>         );
>         JavaPairRDD<LongWritable, BytesWritable> rdd1 = sc.sequenceFile(
>            "hdfs://localhost:9000/part-00000",
>             LongWritable.class,
>             BytesWritable.class
>         ).mapToPair(new PairFunction<Tuple2<LongWritable, BytesWritable>, LongWritable, BytesWritable>() {
>             @Override
>             public Tuple2<LongWritable, BytesWritable> call(Tuple2<LongWritable, BytesWritable> tuple) throws Exception {
>                 return new Tuple2<>(
>                     new LongWritable(tuple._1.get()),
>                     new BytesWritable(tuple._2.copyBytes())
>                 );
>             }
>         }).persist(
>             StorageLevel.MEMORY_ONLY()
>         );
>         System.out.println("Before union: " + rdd1.count());
>         JavaPairRDD<LongWritable, BytesWritable> rdd2 = sc.sequenceFile(
>             "hdfs://localhost:9000/part-00001",
>             LongWritable.class,
>             BytesWritable.class
>         );
>         JavaPairRDD<LongWritable, BytesWritable> joined = rdd1.union(rdd2);
>         System.out.println("After union: " + joined.count());
>     }
> }
> {code}
> You'll need to upload the attached part-00000 and part-00001 to a local hdfs instance (I'm just using a dummy [Single Node Cluster|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html] locally).
> Some things to note:
> - It does not hang if rdd1 is not persisted
> - It does not hang is rdd1 is not materialized (via calling rdd1.count()) before the union-ed RDD is materialized
> - It does not hang if the mapToPair() transformation is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org