You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Emlyn Corrin (JIRA)" <ji...@apache.org> on 2018/04/23 07:40:00 UTC
[jira] [Created] (SPARK-24051) Incorrect results
Emlyn Corrin created SPARK-24051:
------------------------------------
Summary: Incorrect results
Key: SPARK-24051
URL: https://issues.apache.org/jira/browse/SPARK-24051
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.3.0
Reporter: Emlyn Corrin
I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) query, demonstrated by the Java program below. It was simplified from a much more complex query, but I'm having trouble simplifying it further without removing the erroneous behaviour.
{code:java}
package sparktest;
import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import java.util.Arrays;
public class Main {
public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.setMaster("local[*]");
SparkSession session = SparkSession.builder().config(conf).getOrCreate();
Row[] arr1 = new Row[]{
RowFactory.create(1, 42),
RowFactory.create(2, 99)};
StructType sch1 = new StructType(new StructField[]{
new StructField("a", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("b", DataTypes.IntegerType, true, Metadata.empty())});
Dataset<Row> ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
ds1.show();
Row[] arr2 = new Row[]{
RowFactory.create(3)};
StructType sch2 = new StructType(new StructField[]{
new StructField("a", DataTypes.IntegerType, true, Metadata.empty())});
Dataset<Row> ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
.withColumn("b", functions.lit(0));
ds2.show();
Column[] cols = new Column[]{
new Column("a"),
new Column("b").as("b", Metadata.empty()),
functions.count(functions.lit(1))
.over(Window.partitionBy(new Column("a")))
.as("n")};
Dataset<Row> ds = ds1
.select(cols)
.union(ds2.select(cols))
.where(new Column("n").geq(1))
.drop("n");
ds.show();
//ds.explain(true);
}
}
{code}
It just calculates the union of 2 datasets,
{code}
+---+---+
| a| b|
+---+---+
| 1| 42|
| 2| 99|
+---+---+
{code}
with
{code}
+---+---+
| a| b|
+---+---+
| 3| 0|
+---+---+
{code}
The expected result is:
{code}
+---+---+
| a| b|
+---+---+
| 1| 42|
| 2| 99|
| 3| 0|
+---+---+
{code}
but instead it prints:
{code}
+---+---+
| a| b|
+---+---+
| 1| 0|
| 2| 0|
| 3| 0|
+---+---+
{code}
notice how the value in column c is always zero, overriding the original values in rows 1 and 2.
Making seemingly trivial changes, such as replacing {{new Column("b").as("b", Metadata.empty()),}} with just {{new Column("b"),}} or removing the {{where}} clause after the union, make it behave correctly again.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org