You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/09/21 08:13:20 UTC
[jira] [Commented] (SPARK-17621) Accumulator value is doubled when
using DataFrame.orderBy()
[ https://issues.apache.org/jira/browse/SPARK-17621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15509168#comment-15509168 ]
Sean Owen commented on SPARK-17621:
-----------------------------------
I think you've found the issue. You're actually evaluating usersDFwithCount twice here. I think the other one has to do with creating the data frame. So the accumulator is incremented twice.
> Accumulator value is doubled when using DataFrame.orderBy()
> -----------------------------------------------------------
>
> Key: SPARK-17621
> URL: https://issues.apache.org/jira/browse/SPARK-17621
> Project: Spark
> Issue Type: Bug
> Components: Scheduler, SQL
> Affects Versions: 2.0.0
> Environment: Development environment. (Eclipse . Single process)
> Reporter: Sreelal S L
> Priority: Minor
>
> We are tracing the records read by our source using an accumulator. We do a orderBy on the Dataframe before the output operation. When the job is completed, the accumulator values is becoming double of the expected value . .
> Below is the sample code i ran .
> {code}
> val sqlContext = SparkSession.builder()
> .config("spark.sql.retainGroupColumns", false).config("spark.sql.warehouse.dir", "file:///C:/Test").master("local[*]")
> .getOrCreate()
> val sc = sqlContext.sparkContext
> val accumulator1 = sc.accumulator(0, "accumulator1")
> val usersDF = sqlContext.read.json("C:\\users.json") // single row {"name":"sreelal" ,"country":"IND"}
> val usersDFwithCount = usersDF.rdd.map(x => { accumulator1 += 1; x });
> val counterDF = sqlContext.createDataFrame(usersDFwithCount, usersDF.schema);
> val oderedDF = counterDF.orderBy("name")
> val collected = oderedDF.collect()
> collected.foreach { x => println(x) }
> println("accumulator1 : " + accumulator1.value)
> println("Done");
> {code}
> I have only one row in the users.json file. I expect accumulator1 to have value 1. But its coming as 2.
> In the Spark Sql UI , i see two jobs getting generated for the same.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org