You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/23 23:04:59 UTC

[GitHub] [spark] xinrong-meng opened a new pull request, #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

xinrong-meng opened a new pull request, #38778:
URL: https://github.com/apache/spark/pull/38778

   ### What changes were proposed in this pull request?
   Implement `DataFrame.crossJoin` for Spark Connect.
   
   ### Why are the changes needed?
   Part of [SPARK-39375](https://issues.apache.org/jira/browse/SPARK-39375).
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. Spark Connect users can use `DataFrame.crossJoin` as below:
   
   ```py
   >>> from pyspark.sql.connect.client import RemoteSparkSession
   >>> cspark = RemoteSparkSession()
   >>> df = cspark.range(1, 3)
   >>> df.crossJoin(df).show()
   +---+---+
   | id| id|
   +---+---+
   |  1|  1|
   |  1|  2|
   |  2|  1|
   |  2|  2|
   +---+---+
   
   >>> df.crossJoin(df)._plan.to_proto(cspark)
   root {
     join {
       left {
         range {
           start: 1
           end: 3
           step: 1
         }
       }
       right {
         range {
           start: 1
           end: 3
           step: 1
         }
       }
       join_type: JOIN_TYPE_CROSS
     }
   }
   ```
   
   ### How was this patch tested?
   Unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on PR #38778:
URL: https://github.com/apache/spark/pull/38778#issuecomment-1327801386

   In fact, there is a  def test_join_with_join_type(self)` in `test_connect_select_ops.py` which needs to be updated to also test `cross join` type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement DataFrame cross join

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38778:
URL: https://github.com/apache/spark/pull/38778#discussion_r1034246368


##########
python/pyspark/sql/tests/connect/test_connect_plan_only.py:
##########
@@ -58,6 +58,13 @@ def test_join_condition(self):
         )._plan.to_proto(self.connect)
         self.assertIsNotNone(plan.root.join.join_condition)
 
+    def test_crossjoin(self):

Review Comment:
   ```suggestion
       def test_crossjoin(self):
       # SPARK-41227: Test CrossJoin
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #38778:
URL: https://github.com/apache/spark/pull/38778#issuecomment-1327384381

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on a diff in pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

Posted by GitBox <gi...@apache.org>.

xinrong-meng commented on code in PR #38778:
URL: https://github.com/apache/spark/pull/38778#discussion_r1034146372


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -33,7 +33,7 @@ import org.apache.spark.sql.catalyst.expressions
 import org.apache.spark.sql.catalyst.expressions.{Alias, AttributeReference, Expression, NamedExpression, UnsafeProjection}
 import org.apache.spark.sql.catalyst.optimizer.CombineUnions
 import org.apache.spark.sql.catalyst.parser.{CatalystSqlParser, ParseException}
-import org.apache.spark.sql.catalyst.plans.{logical, FullOuter, Inner, JoinType, LeftAnti, LeftOuter, LeftSemi, RightOuter, UsingJoin}
+import org.apache.spark.sql.catalyst.plans.{Cross, FullOuter, Inner, JoinType, LeftAnti, LeftOuter, LeftSemi, RightOuter, UsingJoin, logical}

Review Comment:
   Thanks! I reordered the imports just now.
   
   `./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=fase -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl connector/connect` doesn't complain in my local.
   
   I'll wait to see if scalastyle in the CI jobs still fails.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

Posted by GitBox <gi...@apache.org>.

xinrong-meng commented on PR #38778:
URL: https://github.com/apache/spark/pull/38778#issuecomment-1325753898

   @zhengruifeng @HyukjinKwon @amaliujia @grundprinzip Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement DataFrame cross join

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38778:
URL: https://github.com/apache/spark/pull/38778#issuecomment-1333181249

   merged into master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on PR #38778:
URL: https://github.com/apache/spark/pull/38778#issuecomment-1329715496

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement DataFrame cross join

Posted by GitBox <gi...@apache.org>.

zhengruifeng closed pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement DataFrame cross join
URL: https://github.com/apache/spark/pull/38778


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement DataFrame cross join

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38778:
URL: https://github.com/apache/spark/pull/38778#discussion_r1035853820


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -722,6 +722,19 @@ def test_write_operations(self):
         ndf = self.connect.read.table("parquet_test")
         self.assertEqual(set(df.collect()), set(ndf.collect()))
 
+    def test_crossjoin(self):
+        # SPARK-41227: Test CrossJoin
+        connect_df = self.connect.read.table(self.tbl_name)
+        spark_df = self.spark.read.table(self.tbl_name)
+        self.assertEqual(
+            connect_df.join(other=connect_df, how="cross").toPandas(),
+            spark_df.join(other=spark_df, how="cross").toPandas(),
+        )
+        self.assertEqual(

Review Comment:
   ahhh, the columns are duplicated:
   ```
   In [1]: df = spark.createDataFrame([(10, 80, "Alice"), (5, None, "Bob"), (None, 10, "Tom"), (None, None, None)], schema=["age", "height",
      ...: "name"])
   
   In [2]: df.crossJoin(df).show()
   +----+------+-----+----+------+-----+
   | age|height| name| age|height| name|
   +----+------+-----+----+------+-----+
   |  10|    80|Alice|  10|    80|Alice|
   |  10|    80|Alice|   5|  null|  Bob|
   |  10|    80|Alice|null|    10|  Tom|
   ```
   
   I think you can use test it like this:
   ```
           self.assertEqual(
               set(connect_df.select("id").join(other=connect_df.select("name"), how="cross").toPandas()),
               set(spark_df.select("id").join(other=spark_df.select("name"), how="cross").toPandas()),
           )
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on a diff in pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement DataFrame cross join

Posted by GitBox <gi...@apache.org>.

xinrong-meng commented on code in PR #38778:
URL: https://github.com/apache/spark/pull/38778#discussion_r1035418421


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -722,6 +722,19 @@ def test_write_operations(self):
         ndf = self.connect.read.table("parquet_test")
         self.assertEqual(set(df.collect()), set(ndf.collect()))
 
+    def test_crossjoin(self):
+        # SPARK-41227: Test CrossJoin
+        connect_df = self.connect.read.table(self.tbl_name)
+        spark_df = self.spark.read.table(self.tbl_name)
+        self.assertEqual(
+            connect_df.join(other=connect_df, how="cross").toPandas(),
+            spark_df.join(other=spark_df, how="cross").toPandas(),
+        )
+        self.assertEqual(

Review Comment:
   Thanks! I used `collect` instead otherwise pandas complains about duplicate column indexes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

Posted by GitBox <gi...@apache.org>.

grundprinzip commented on PR #38778:
URL: https://github.com/apache/spark/pull/38778#issuecomment-1326090172

   For fixing the scalastyle issue please run:
   
   ```
   ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=fase -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl connector/connect
   ```
   
   With the additional tests requested I'm good with the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement DataFrame cross join

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38778:
URL: https://github.com/apache/spark/pull/38778#issuecomment-1330002115

   let's also add an e2e test in `test_connect_basic.py`, otherwise LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on a diff in pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

Posted by GitBox <gi...@apache.org>.

grundprinzip commented on code in PR #38778:
URL: https://github.com/apache/spark/pull/38778#discussion_r1034119329


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -33,7 +33,7 @@ import org.apache.spark.sql.catalyst.expressions
 import org.apache.spark.sql.catalyst.expressions.{Alias, AttributeReference, Expression, NamedExpression, UnsafeProjection}
 import org.apache.spark.sql.catalyst.optimizer.CombineUnions
 import org.apache.spark.sql.catalyst.parser.{CatalystSqlParser, ParseException}
-import org.apache.spark.sql.catalyst.plans.{logical, FullOuter, Inner, JoinType, LeftAnti, LeftOuter, LeftSemi, RightOuter, UsingJoin}
+import org.apache.spark.sql.catalyst.plans.{Cross, FullOuter, Inner, JoinType, LeftAnti, LeftOuter, LeftSemi, RightOuter, UsingJoin, logical}

Review Comment:
   /home/runner/work/spark/spark/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:36:43: logical should come before UsingJoin.
   
   I have yet to find a configuration, where Intelij does not mess this up for me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on a diff in pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

Posted by GitBox <gi...@apache.org>.

xinrong-meng commented on code in PR #38778:
URL: https://github.com/apache/spark/pull/38778#discussion_r1034166566


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -33,7 +33,7 @@ import org.apache.spark.sql.catalyst.expressions
 import org.apache.spark.sql.catalyst.expressions.{Alias, AttributeReference, Expression, NamedExpression, UnsafeProjection}
 import org.apache.spark.sql.catalyst.optimizer.CombineUnions
 import org.apache.spark.sql.catalyst.parser.{CatalystSqlParser, ParseException}
-import org.apache.spark.sql.catalyst.plans.{logical, FullOuter, Inner, JoinType, LeftAnti, LeftOuter, LeftSemi, RightOuter, UsingJoin}
+import org.apache.spark.sql.catalyst.plans.{Cross, FullOuter, Inner, JoinType, LeftAnti, LeftOuter, LeftSemi, RightOuter, UsingJoin, logical}

Review Comment:
   That's helpful! `Scalastyle checks passed. Scalafmt checks passed.` as well. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement DataFrame cross join

Posted by GitBox <gi...@apache.org>.

xinrong-meng commented on PR #38778:
URL: https://github.com/apache/spark/pull/38778#issuecomment-1331164803

   Forced push to resolve conflicts with the master branch. [622d671](https://github.com/apache/spark/pull/38778/commits/622d671ebd608df9c89a73370982ea198e74f0a9) is mainly the new change after the last review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement DataFrame cross join

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38778:
URL: https://github.com/apache/spark/pull/38778#discussion_r1035176076


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -722,6 +722,19 @@ def test_write_operations(self):
         ndf = self.connect.read.table("parquet_test")
         self.assertEqual(set(df.collect()), set(ndf.collect()))
 
+    def test_crossjoin(self):
+        # SPARK-41227: Test CrossJoin
+        connect_df = self.connect.read.table(self.tbl_name)
+        spark_df = self.spark.read.table(self.tbl_name)
+        self.assertEqual(
+            connect_df.join(other=connect_df, how="cross").toPandas(),
+            spark_df.join(other=spark_df, how="cross").toPandas(),
+        )
+        self.assertEqual(

Review Comment:
   you can use `self.assert_eq` which is better to compare two Pandas DataFrame



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38778:
URL: https://github.com/apache/spark/pull/38778#discussion_r1030950684


##########
python/pyspark/sql/connect/plan.py:
##########
@@ -640,6 +640,8 @@ def __init__(
             join_type = proto.Join.JoinType.JOIN_TYPE_LEFT_SEMI
         elif how in ["leftanti", "anti"]:
             join_type = proto.Join.JoinType.JOIN_TYPE_LEFT_ANTI
+        elif how == "cross":
+            join_type = proto.Join.JoinType.JOIN_TYPE_CROSS

Review Comment:
   you will also need to update the error message in `else` in which `cross join` is not mentioned:
   
   ``` 
   else:
               raise NotImplementedError(
                   """
                   Unsupported join type: %s. Supported join types include:
                   "inner", "outer", "full", "fullouter", "full_outer",
                   "leftouter", "left", "left_outer", "rightouter",
                   "right", "right_outer", "leftsemi", "left_semi",
                   "semi", "leftanti", "left_anti", "anti",
                   """
                   % how
               )
    ```



##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -19,9 +19,7 @@ package org.apache.spark.sql.connect.planner
 
 import scala.collection.JavaConverters._
 import scala.collection.mutable
-

Review Comment:
   Should revert these?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xinrong-meng commented on pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

Posted by GitBox <gi...@apache.org>.

xinrong-meng commented on PR #38778:
URL: https://github.com/apache/spark/pull/38778#issuecomment-1329648195

   Forced push to resolve conflicts with the master branch. [b940bc9](https://github.com/apache/spark/pull/38778/commits/b940bc96c6cde4935e8f692954ffe8dfc7a484dc) is mainly the new change after the last review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38778:
URL: https://github.com/apache/spark/pull/38778#discussion_r1034150994


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -33,7 +33,7 @@ import org.apache.spark.sql.catalyst.expressions
 import org.apache.spark.sql.catalyst.expressions.{Alias, AttributeReference, Expression, NamedExpression, UnsafeProjection}
 import org.apache.spark.sql.catalyst.optimizer.CombineUnions
 import org.apache.spark.sql.catalyst.parser.{CatalystSqlParser, ParseException}
-import org.apache.spark.sql.catalyst.plans.{logical, FullOuter, Inner, JoinType, LeftAnti, LeftOuter, LeftSemi, RightOuter, UsingJoin}
+import org.apache.spark.sql.catalyst.plans.{Cross, FullOuter, Inner, JoinType, LeftAnti, LeftOuter, LeftSemi, RightOuter, UsingJoin, logical}

Review Comment:
   Just FYI that `scalafmt` seems to not deal with imports.
   
   `./dev/lint-scala` can run locally that tell you the scalestyle check result.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38778:
URL: https://github.com/apache/spark/pull/38778#discussion_r1031016264


##########
python/pyspark/sql/tests/connect/test_connect_plan_only.py:
##########
@@ -58,6 +58,12 @@ def test_join_condition(self):
         )._plan.to_proto(self.connect)
         self.assertIsNotNone(plan.root.join.join_condition)
 
+    def test_crossjoin(self):
+        left_input = self.connect.readTable(table_name=self.tbl_name)
+        right_input = self.connect.readTable(table_name=self.tbl_name)
+        plan = left_input.crossJoin(other=right_input)._plan.to_proto(self.connect)
+        self.assertEqual(plan.root.join.join_type, 7)  # JOIN_TYPE_CROSS
+

Review Comment:
   after this PR, `DataFrame.join` also support `cross`, I think we can also add a test for it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org