You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pavel Klemenkov <pk...@gmail.com> on 2017/05/10 14:11:17 UTC
[Spark Core]: Python and Scala generate different DAGs for identical code
This Scala code:
scala> val logs = sc.textFile("big_data_specialization/log.txt").
| filter(x => !x.contains("INFO")).
| map(x => (x.split("\t")(1), 1)).
| reduceByKey((x, y) => x + y)
generated obvious lineage:
(2) ShuffledRDD[4] at reduceByKey at <console>:27 []
+-(2) MapPartitionsRDD[3] at map at <console>:26 []
| MapPartitionsRDD[2] at filter at <console>:25 []
| big_data_specialization/log.txt MapPartitionsRDD[1] at textFile at
<console>:24 []
| big_data_specialization/log.txt HadoopRDD[0] at textFile at
<console>:24 []
But Python code:
logs = sc.textFile("../log.txt")\
.filter(lambda x: 'INFO' not in x)\
.map(lambda x: (x.split('\t')[1], 1))\
.reduceByKey(lambda x, y: x + y)
generated something strange which is hard to follow:
(2) PythonRDD[13] at RDD at PythonRDD.scala:48 []
| MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:422 []
| ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:0 []
+-(2) PairwiseRDD[10] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 []
| PythonRDD[9] at reduceByKey at <ipython-input-9-d6a34e0335b0>:1 []
| ../log.txt MapPartitionsRDD[8] at textFile at
NativeMethodAccessorImpl.java:0 []
| ../log.txt HadoopRDD[7] at textFile at
NativeMethodAccessorImpl.java:0 []
Why is that? Does pyspark do some optimizations under the hood? This debug
string is really useless for debugging.
--
Yours faithfully, Pavel Klemenkov.