You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2024/01/11 10:36:48 UTC

[PR] [SPARK-46677][SQL][CONNECT] Fix `df.col("*")` resolution [spark]

zhengruifeng opened a new pull request, #44689:
URL: https://github.com/apache/spark/pull/44689

   ### What changes were proposed in this pull request?
   On Spark Connect, `df.col("*")` should be resolved against the target plan
   
   ### Why are the changes needed?
   ```
   In [6]: df1 = spark.createDataFrame([{"id": 1}])
   
   In [7]: df2 = spark.createDataFrame([{"id": 1, "val": "v"}])
   
   In [8]: df1.join(df2)
   Out[8]: DataFrame[id: bigint, id: bigint, val: string]
   
   In [9]: df1.join(df2).select(df1["*"])
   Out[9]: DataFrame[id: bigint, id: bigint, val: string]
   ```
   
   it should be
   ```
   In [3]: df1.join(df2).select(df1["*"])
   Out[3]: DataFrame[id: bigint]
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   yes
   
   ### How was this patch tested?
   added ut
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1449765594


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala:
##########
@@ -696,6 +696,37 @@ case class ResolvedStar(expressions: Seq[NamedExpression]) extends Star with Une
   override def toString: String = expressions.mkString("ResolvedStar(", ", ", ")")
 }
 
+/**
+ * Represents all input attributes to a given relational operator.
+ * This is used in Spark Connect dataframe, for example:
+ *    df1 = spark.createDataFrame([{"id": 1}])
+ *    df2 = spark.createDataFrame([{"id": 1, "val": "v"}])
+ *    df1.join(df2, "id").select(df1["*"])
+ * @param planId the plan id of target node.
+ */
+case class UnresolvedDataFrameStar(planId: Long) extends Star with Unevaluable {
+  override def expand(input: LogicalPlan, resolver: Resolver): Seq[NamedExpression] = {
+    val resolved = resolveDFStarRecursively(planId, input)
+    resolved.map(_.expand(input, resolver)).getOrElse(
+      throw QueryCompilationErrors.cannotResolveStar(this)
+    )
+  }
+
+  private def resolveDFStarRecursively(
+    id: Long,
+    p: LogicalPlan): Option[ResolvedStar] = {
+    val resolved = if (p.getTagValue(LogicalPlan.PLAN_ID_TAG).contains(id)) {
+      Some(ResolvedStar(p.output))
+    } else {
+      p.children.iterator.map(resolveDFStarRecursively(id, _))
+        .foldLeft(Option.empty[ResolvedStar]) {

Review Comment:
   yeah, let me fail it in spark connect anyway



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1449707794


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala:
##########
@@ -696,6 +696,37 @@ case class ResolvedStar(expressions: Seq[NamedExpression]) extends Star with Une
   override def toString: String = expressions.mkString("ResolvedStar(", ", ", ")")
 }
 
+/**
+ * Represents all input attributes to a given relational operator.
+ * This is used in Spark Connect dataframe, for example:
+ *    df1 = spark.createDataFrame([{"id": 1}])
+ *    df2 = spark.createDataFrame([{"id": 1, "val": "v"}])
+ *    df1.join(df2, "id").select(df1["*"])
+ * @param planId the plan id of target node.
+ */
+case class UnresolvedDataFrameStar(planId: Long) extends Star with Unevaluable {

Review Comment:
   Then we can handle it in `ColumnResolutionHelper` and reuse code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1451088723


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -69,6 +69,26 @@ def test_range(self):
         self.assertEqual(self.spark.range(-2).count(), 0)
         self.assertEqual(self.spark.range(3).count(), 3)
 
+    def test_dataframe_star(self):

Review Comment:
   CI run this test in both connect and vanilla



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1448649195


##########
python/pyspark/sql/connect/functions/builtin.py:
##########
@@ -76,15 +76,6 @@
     from pyspark.sql.connect.udtf import UserDefinedTableFunction
 
 
-def _to_col_with_plan_id(col: str, plan_id: Optional[int]) -> Column:

Review Comment:
   delete this helper function due to the behavior difference between `Dataset#col` and `functions#col`
   
   https://github.com/apache/spark/blob/d2f572428be5346dfa412f6588e72686429ddc71/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1452-L1461
   
   https://github.com/apache/spark/blob/0a791993be7b6f4b843887403460ef9aebe3daf9/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L154-L162
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1449707533


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala:
##########
@@ -696,6 +696,37 @@ case class ResolvedStar(expressions: Seq[NamedExpression]) extends Star with Une
   override def toString: String = expressions.mkString("ResolvedStar(", ", ", ")")
 }
 
+/**
+ * Represents all input attributes to a given relational operator.
+ * This is used in Spark Connect dataframe, for example:
+ *    df1 = spark.createDataFrame([{"id": 1}])
+ *    df2 = spark.createDataFrame([{"id": 1, "val": "v"}])
+ *    df1.join(df2, "id").select(df1["*"])
+ * @param planId the plan id of target node.
+ */
+case class UnresolvedDataFrameStar(planId: Long) extends Star with Unevaluable {
+  override def expand(input: LogicalPlan, resolver: Resolver): Seq[NamedExpression] = {
+    val resolved = resolveDFStarRecursively(planId, input)
+    resolved.map(_.expand(input, resolver)).getOrElse(
+      throw QueryCompilationErrors.cannotResolveStar(this)
+    )
+  }
+
+  private def resolveDFStarRecursively(
+    id: Long,
+    p: LogicalPlan): Option[ResolvedStar] = {
+    val resolved = if (p.getTagValue(LogicalPlan.PLAN_ID_TAG).contains(id)) {
+      Some(ResolvedStar(p.output))
+    } else {
+      p.children.iterator.map(resolveDFStarRecursively(id, _))
+        .foldLeft(Option.empty[ResolvedStar]) {

Review Comment:
   It's probably a bug in vanilla spark...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `df.col("*")` resolution [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1448649195


##########
python/pyspark/sql/connect/functions/builtin.py:
##########
@@ -76,15 +76,6 @@
     from pyspark.sql.connect.udtf import UserDefinedTableFunction
 
 
-def _to_col_with_plan_id(col: str, plan_id: Optional[int]) -> Column:

Review Comment:
   delete this helper function due to the behavior difference between
   
   https://github.com/apache/spark/blob/d2f572428be5346dfa412f6588e72686429ddc71/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1452-L1461
   
   https://github.com/apache/spark/blob/0a791993be7b6f4b843887403460ef9aebe3daf9/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L154-L162
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1449748636


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala:
##########
@@ -696,6 +696,37 @@ case class ResolvedStar(expressions: Seq[NamedExpression]) extends Star with Une
   override def toString: String = expressions.mkString("ResolvedStar(", ", ", ")")
 }
 
+/**
+ * Represents all input attributes to a given relational operator.
+ * This is used in Spark Connect dataframe, for example:
+ *    df1 = spark.createDataFrame([{"id": 1}])
+ *    df2 = spark.createDataFrame([{"id": 1, "val": "v"}])
+ *    df1.join(df2, "id").select(df1["*"])
+ * @param planId the plan id of target node.
+ */
+case class UnresolvedDataFrameStar(planId: Long) extends Star with Unevaluable {

Review Comment:
   got it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1449706559


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala:
##########
@@ -696,6 +696,37 @@ case class ResolvedStar(expressions: Seq[NamedExpression]) extends Star with Une
   override def toString: String = expressions.mkString("ResolvedStar(", ", ", ")")
 }
 
+/**
+ * Represents all input attributes to a given relational operator.
+ * This is used in Spark Connect dataframe, for example:
+ *    df1 = spark.createDataFrame([{"id": 1}])
+ *    df2 = spark.createDataFrame([{"id": 1, "val": "v"}])
+ *    df1.join(df2, "id").select(df1["*"])
+ * @param planId the plan id of target node.
+ */
+case class UnresolvedDataFrameStar(planId: Long) extends Star with Unevaluable {

Review Comment:
   This does not need to be a star. It's just a placeholder and will be replaced by `ResolvedStar` after finding the matching plan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1452009946


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1719,14 +1724,31 @@ def __getitem__(self, item: Union[Column, List, Tuple]) -> "DataFrame":
 
     def __getitem__(self, item: Union[int, str, Column, List, Tuple]) -> Union[Column, "DataFrame"]:
         if isinstance(item, str):
-            # validate the column name
-            if not hasattr(self._session, "is_mock_session"):
-                self.select(item).isLocal()
-
-            return _to_col_with_plan_id(
-                col=item,
-                plan_id=self._plan._plan_id,
-            )
+            if item == "*":
+                return Column(
+                    UnresolvedStar(
+                        unparsed_target=None,
+                        plan_id=self._plan._plan_id,
+                    )
+                )
+            else:
+                # TODO: revisit vanilla Spark's Dataset.col

Review Comment:
   It's off by default anyway, so we can throw a proper error if it's enabled in spark connect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1452010622


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -558,6 +558,35 @@ def test_invalid_column(self):
         ):
             cdf1.select(cdf2.a).schema
 
+    def test_invalid_star(self):

Review Comment:
   what's the difference between Connect and Classic for this test?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1451088843


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -558,6 +558,35 @@ def test_invalid_column(self):
         ):
             cdf1.select(cdf2.a).schema
 
+    def test_invalid_star(self):

Review Comment:
   CI run this test only in Connect



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #44689:
URL: https://github.com/apache/spark/pull/44689#issuecomment-1892904943

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #44689: [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution
URL: https://github.com/apache/spark/pull/44689


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1448653299


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1719,14 +1724,31 @@ def __getitem__(self, item: Union[Column, List, Tuple]) -> "DataFrame":
 
     def __getitem__(self, item: Union[int, str, Column, List, Tuple]) -> Union[Column, "DataFrame"]:
         if isinstance(item, str):
-            # validate the column name
-            if not hasattr(self._session, "is_mock_session"):
-                self.select(item).isLocal()
-
-            return _to_col_with_plan_id(
-                col=item,
-                plan_id=self._plan._plan_id,
-            )
+            if item == "*":
+                return Column(
+                    UnresolvedStar(
+                        unparsed_target=None,
+                        plan_id=self._plan._plan_id,
+                    )
+                )
+            else:
+                # TODO: revisit vanilla Spark's Dataset.col

Review Comment:
   TODO for myself, should revisit the implementation of `colRegex`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `df.col("*")` resolution [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1448645368


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala:
##########
@@ -696,6 +696,37 @@ case class ResolvedStar(expressions: Seq[NamedExpression]) extends Star with Une
   override def toString: String = expressions.mkString("ResolvedStar(", ", ", ")")
 }
 
+/**
+ * Represents all input attributes to a given relational operator.
+ * This is used in Spark Connect dataframe, for example:
+ *    df1 = spark.createDataFrame([{"id": 1}])
+ *    df2 = spark.createDataFrame([{"id": 1, "val": "v"}])
+ *    df1.join(df2, "id").select(df1["*"])
+ * @param planId the plan id of target node.
+ */
+case class UnresolvedDataFrameStar(planId: Long) extends Star with Unevaluable {
+  override def expand(input: LogicalPlan, resolver: Resolver): Seq[NamedExpression] = {
+    val resolved = resolveDFStarRecursively(planId, input)
+    resolved.map(_.expand(input, resolver)).getOrElse(
+      throw QueryCompilationErrors.cannotResolveStar(this)
+    )
+  }
+
+  private def resolveDFStarRecursively(
+    id: Long,
+    p: LogicalPlan): Option[ResolvedStar] = {
+    val resolved = if (p.getTagValue(LogicalPlan.PLAN_ID_TAG).contains(id)) {
+      Some(ResolvedStar(p.output))
+    } else {
+      p.children.iterator.map(resolveDFStarRecursively(id, _))
+        .foldLeft(Option.empty[ResolvedStar]) {

Review Comment:
   don't add an ambiguous detection for now, since in vanilla Spark,
   
   ```
   In [7]: df1 = spark.createDataFrame([{"id": 1}])
   
   In [8]: df1.join(df1)
   Out[8]: DataFrame[id: bigint, id: bigint]
   
   In [9]: df1.join(df1).select(df1["id"])
   ...
   AnalysisException: Column id#0L are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. 
   
   In [10]: df1.join(df1).select(df1["*"])
   Out[10]: DataFrame[id: bigint]
   ```
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1452017640


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -558,6 +558,35 @@ def test_invalid_column(self):
         ):
             cdf1.select(cdf2.a).schema
 
+    def test_invalid_star(self):

Review Comment:
   ```
   In [4]: cdf1 = spark.createDataFrame([Row(a=1, b=2, c=3)])
   
   In [5]: cdf2 = spark.createDataFrame([Row(a=2, b=0)])
   
   In [6]: cdf3 = cdf1.select(cdf1.a)
   
   In [7]: cdf3.select(cdf1["*"]).schema
   ...
   AnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_MISSING_FROM_INPUT] Resolved attribute(s) "b", "c" missing from "a" in operator !Project [a#0L, b#1L, c#2L].  SQLSTATE: XX000;
   !Project [a#0L, b#1L, c#2L]
   +- Project [a#0L]
      +- LogicalRDD [a#0L, b#1L, c#2L], false
   
   
   In [8]: cdf1.select(cdf2["*"]).schema
   ...
   AnalysisException: [MISSING_ATTRIBUTES.RESOLVED_ATTRIBUTE_APPEAR_IN_OPERATION] Resolved attribute(s) "a", "b" missing from "a", "b", "c" in operator !Project [a#6L, b#7L]. Attribute(s) with the same name appear in the operation: "a", "b".
   Please check if the right attribute(s) are used. SQLSTATE: XX000;
   !Project [a#6L, b#7L]
   +- LogicalRDD [a#0L, b#1L, c#2L], false
   
   
   In [9]: cdf1.join(cdf1).select(cdf1["*"]).schema
   Out[9]: StructType([StructField('a', LongType(), True), StructField('b', LongType(), True), StructField('c', LongType(), True)])
   ```
   
   `cdf1.join(cdf1).select(cdf1["*"])` won't fail due to AMBIGUOUS_COLUMN_REFERENCE



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46677][SQL][CONNECT] Fix `dataframe["*"]` resolution [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #44689:
URL: https://github.com/apache/spark/pull/44689#discussion_r1452009435


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1719,14 +1724,31 @@ def __getitem__(self, item: Union[Column, List, Tuple]) -> "DataFrame":
 
     def __getitem__(self, item: Union[int, str, Column, List, Tuple]) -> Union[Column, "DataFrame"]:
         if isinstance(item, str):
-            # validate the column name
-            if not hasattr(self._session, "is_mock_session"):
-                self.select(item).isLocal()
-
-            return _to_col_with_plan_id(
-                col=item,
-                plan_id=self._plan._plan_id,
-            )
+            if item == "*":
+                return Column(
+                    UnresolvedStar(
+                        unparsed_target=None,
+                        plan_id=self._plan._plan_id,
+                    )
+                )
+            else:
+                # TODO: revisit vanilla Spark's Dataset.col

Review Comment:
   We can probably skip it in spark connect. It's really a weird feature and non-standard.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org