You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sreelal S L (JIRA)" <ji...@apache.org> on 2016/09/21 07:49:20 UTC
[jira] [Created] (SPARK-17621) Accumulator value is doubled when
using DataFrame.orderBy()
Sreelal S L created SPARK-17621:
-----------------------------------
Summary: Accumulator value is doubled when using DataFrame.orderBy()
Key: SPARK-17621
URL: https://issues.apache.org/jira/browse/SPARK-17621
Project: Spark
Issue Type: Bug
Components: Scheduler, SQL
Affects Versions: 2.0.0
Environment: Development environment. (Eclipse . Single process)
Reporter: Sreelal S L
Priority: Minor
We are tracing the records read by our source using an accumulator. We do a orderBy on the Dataframe before the output operation. When the job is completed, the accumulator values is becoming double of the expected value . .
Below is the sample code i ran .
{code}
val sqlContext = SparkSession.builder()
.config("spark.sql.retainGroupColumns", false).config("spark.sql.warehouse.dir", "file:///C:/Test").master("local[*]")
.getOrCreate()
val sc = sqlContext.sparkContext
val accumulator1 = sc.accumulator(0, "accumulator1")
val usersDF = sqlContext.read.json("C:\\users.json") // single row {"name":"sreelal" ,"country":"IND"}
val usersDFwithCount = usersDF.rdd.map(x => { accumulator1 += 1; x });
val counterDF = sqlContext.createDataFrame(usersDFwithCount, usersDF.schema);
val oderedDF = counterDF.orderBy("name")
val collected = oderedDF.collect()
collected.foreach { x => println(x) }
println("accumulator1 : " + accumulator1.value)
println("Done");
{code}
I have only one row in the users.json file. I expect accumulator1 to have value 1. But its coming as 2.
In the Spark Sql UI , i see two jobs getting generated for the same.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org