You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/09/06 05:05:04 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

zhengruifeng opened a new pull request, #42828:
URL: https://github.com/apache/spark/pull/42828

   ### What changes were proposed in this pull request?
   
   - Make `getitem` work with duplicated columns
   - Disallow bool type index
   - Disallow negative index
   
   
   ### Why are the changes needed?
   1, SQL feature OrderBy ordinal works with duplicated columns
   ```
   In [4]: df = spark.sql("SELECT * FROM VALUES (1, 1.1, 'a'), (2, 2.2, 'b'), (4, 4.4, 'c') AS TAB(a, a, a)")
   
   In [5]: df.createOrReplaceTempView("v")
   
   In [6]: spark.sql("SELECT * FROM v ORDER BY 1, 2").show()
   +---+---+---+
   |  a|  a|  a|
   +---+---+---+
   |  1|1.1|  a|
   |  2|2.2|  b|
   |  4|4.4|  c|
   +---+---+---+
   ```
   
   To support it in DataFame APIs, we need to make `getitem` work with duplicated columns
   
   
   2 & 3: the support should be unintentional
   
   
   ### Does this PR introduce _any_ user-facing change?
   1, Make `getitem` work with duplicated columns
   ```
   In [6]: df = spark.sql("SELECT * FROM VALUES (1, 1.1, 'a'), (2, 2.2, 'b'), (4, 4.4, 'c') AS TAB(a, a, a)")
   
   In [7]: df.orderBy(1, 2).show()
   ---------------------------------------------------------------------------
   AnalysisException                         Traceback (most recent call last)
   Cell In[7], line 1
   ----> 1 df.orderBy(1, 2).show()
   
   ...
   
   AnalysisException: [AMBIGUOUS_REFERENCE] Reference `a` is ambiguous, could be: [`TAB`.`a`, `TAB`.`a`, `TAB`.`a`].
   
   ```
   
   after
   
   ```
   In [1]: df = spark.sql("SELECT * FROM VALUES (1, 1.1, 'a'), (2, 2.2, 'b'), (4, 4.4, 'c') AS TAB(a, a, a)")
   
   In [2]: df[0]
   Out[2]: Column<'a'>
   
   In [3]: df[1]
   Out[3]: Column<'a'>
   
   In [4]: df.orderBy(1, 2).show()
   +---+---+---+
   |  a|  a|  a|
   +---+---+---+
   |  1|1.1|  a|
   |  2|2.2|  b|
   |  4|4.4|  c|
   +---+---+---+
   ```
   
   
   2, Disallow bool type index
   before
   ```
   In [1]: df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", "name"],)
   
   In [2]: df[False]
   Out[2]: Column<'age'>
   
   In [3]: df[True]
   Out[3]: Column<'name'>
   ```
   
   after
   ```
   In [2]: df[True]
   ---------------------------------------------------------------------------
   PySparkTypeError                          Traceback (most recent call last)
   ...
   PySparkTypeError: [NOT_COLUMN_OR_FLOAT_OR_INT_OR_LIST_OR_STR] Argument `item` should be a column, float, integer, list or string, got bool.
   ```
   
   3, Disallow negative index
   before
   ```
   In [1]: df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", "name"],)
   
   In [4]: df[-1]
   Out[4]: Column<'name'>
   
   In [5]: df[-2]
   Out[5]: Column<'age'>
   ```
   
   after
   ```
   In [3]: df[-1]
   ---------------------------------------------------------------------------
   IndexError                                Traceback (most recent call last)
   ...
   IndexError: Column index must be non-negative but got -1
   ```
   
   
   
   ### How was this patch tested?
   added UTs
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   NO


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1317989896


##########
python/pyspark/sql/dataframe.py:
##########
@@ -3455,8 +3455,14 @@ def __getitem__(self, item: Union[int, str, Column, List, Tuple]) -> Union[Colum
             return self.filter(item)
         elif isinstance(item, (list, tuple)):
             return self.select(*item)
-        elif isinstance(item, int):
-            jc = self._jdf.apply(self.columns[item])
+        elif isinstance(item, int) and not isinstance(item, bool):

Review Comment:
   ok, let me revert this part and add test cases for `bool`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1317985272


##########
python/pyspark/sql/dataframe.py:
##########
@@ -3455,8 +3455,14 @@ def __getitem__(self, item: Union[int, str, Column, List, Tuple]) -> Union[Colum
             return self.filter(item)
         elif isinstance(item, (list, tuple)):
             return self.select(*item)
-        elif isinstance(item, int):
-            jc = self._jdf.apply(self.columns[item])
+        elif isinstance(item, int) and not isinstance(item, bool):

Review Comment:
   Allowing `bool` is fine since `bool` inherits `int` in Python.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng closed pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns
URL: https://github.com/apache/spark/pull/42828


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1345375669


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -63,6 +62,51 @@
 
 
 class DataFrameTestsMixin:
+    def test_getitem_invalid_indices(self):
+        df = self.spark.sql(
+            "SELECT * FROM VALUES "
+            "(1, 1.1, 'a'), "
+            "(2, 2.2, 'b'), "
+            "(4, 4.4, 'c') "
+            "AS TAB(a, b, c)"
+        )
+
+        # accepted type and values
+        for index in [False, True, 0, 1, 2, -1, -2, -3]:
+            df[index]

Review Comment:
   ```
   scala> val df = Seq((2, 1), (1, 2)).toDF("a", "b")
   val df: org.apache.spark.sql.DataFrame = [a: int, b: int]
   
   scala> df.show()
   +---+---+
   |  a|  b|
   +---+---+
   |  2|  1|
   |  1|  2|
   +---+---+
   
   
   scala> df.orderBy(lit(1)).show()
   +---+---+
   |  a|  b|
   +---+---+
   |  1|  2|
   |  2|  1|
   +---+---+
   
   
   scala> df.groupBy(lit(1)).agg(first(col("a")), max(col("b"))).show()
   +---+--------+------+
   |  1|first(a)|max(b)|
   +---+--------+------+
   |  1|       2|     2|
   +---+--------+------+
   ```
   
   it seems `orderBy(lit(1))` directly works, while `groupBy(lit(1))` needs some investigation.
   
   Let me revert this PR first



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1338272827


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -63,6 +62,51 @@
 
 
 class DataFrameTestsMixin:
+    def test_getitem_invalid_indices(self):
+        df = self.spark.sql(
+            "SELECT * FROM VALUES "
+            "(1, 1.1, 'a'), "
+            "(2, 2.2, 'b'), "
+            "(4, 4.4, 'c') "
+            "AS TAB(a, b, c)"
+        )
+
+        # accepted type and values
+        for index in [False, True, 0, 1, 2, -1, -2, -3]:
+            df[index]

Review Comment:
   https://github.com/apache/spark/commit/6183b5e2caedd074073d0f6cb6609a634e2f5194
   
   `df[index]` has been supported since spark 2.0.0.
   
   To support `df.groupBy(1, 2, 3)` and `df.orderBy(1, 2, 3)`, right now `GetColumnByOrdinal` is only used in the direct child internally.
   
   > The SQL parser also parses ORDER BY 1, 2 as ordering by integer literal 1 and 2, and analyzer will properly resolve it.
   
   Do you mean use should directly `SortOrder(UnresolvedOrdinal(index))` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1338149416


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -63,6 +62,51 @@
 
 
 class DataFrameTestsMixin:
+    def test_getitem_invalid_indices(self):
+        df = self.spark.sql(
+            "SELECT * FROM VALUES "
+            "(1, 1.1, 'a'), "
+            "(2, 2.2, 'b'), "
+            "(4, 4.4, 'c') "
+            "AS TAB(a, b, c)"
+        )
+
+        # accepted type and values
+        for index in [False, True, 0, 1, 2, -1, -2, -3]:
+            df[index]

Review Comment:
   If `df[index]` is already in pyspark for a while, I think it's fine to treat it as a shortcut of `df.i_th_col`. We shouldn't use `GetColumnByOrdinal` in this case, as it was added for Dataset Tuple encoding and it's guaranteed that we want to get the column from the direct child of the current plan node. But here, we can't guarantee this, as people can do `df1.select..filter...groupBy...select(df1[index])`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1338141627


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -63,6 +62,51 @@
 
 
 class DataFrameTestsMixin:
+    def test_getitem_invalid_indices(self):
+        df = self.spark.sql(
+            "SELECT * FROM VALUES "
+            "(1, 1.1, 'a'), "
+            "(2, 2.2, 'b'), "
+            "(4, 4.4, 'c') "
+            "AS TAB(a, b, c)"
+        )
+
+        # accepted type and values
+        for index in [False, True, 0, 1, 2, -1, -2, -3]:
+            df[index]

Review Comment:
   This is really a bad API. `df.col` can be ambiguous as people may use the column reference far away from the dataframe, e.g. `df1.join(df2).select...filter...select(df1.col)`. We recommend users use qualified unresolved column instead, like `col("t1.col")`. Now `df[index]` is even worse as it only makes sense to use it intermediately in current df's transformation.
   
   Why do we add such an API? To support order by ordinal, we can just order by integer literals. The SQL parser also parses `ORDER BY 1, 2` as ordering by integer literal 1 and 2, and analyzer will properly resolve it.
   
   cc @HyukjinKwon @zhengruifeng 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1317985461


##########
python/pyspark/sql/dataframe.py:
##########
@@ -3455,8 +3455,14 @@ def __getitem__(self, item: Union[int, str, Column, List, Tuple]) -> Union[Colum
             return self.filter(item)
         elif isinstance(item, (list, tuple)):
             return self.select(*item)
-        elif isinstance(item, int):
-            jc = self._jdf.apply(self.columns[item])
+        elif isinstance(item, int) and not isinstance(item, bool):
+            if item < 0:
+                raise IndexError(f"Column index must be non-negative but got {item}")

Review Comment:
   and I think we can allow negative ones too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1317990021


##########
python/pyspark/sql/dataframe.py:
##########
@@ -3455,8 +3455,14 @@ def __getitem__(self, item: Union[int, str, Column, List, Tuple]) -> Union[Colum
             return self.filter(item)
         elif isinstance(item, (list, tuple)):
             return self.select(*item)
-        elif isinstance(item, int):
-            jc = self._jdf.apply(self.columns[item])
+        elif isinstance(item, int) and not isinstance(item, bool):
+            if item < 0:
+                raise IndexError(f"Column index must be non-negative but got {item}")

Review Comment:
   ok, let me revert this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1320894428


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -26,7 +26,6 @@
 import io
 from contextlib import redirect_stdout
 
-from pyspark import StorageLevel

Review Comment:
   it is not related, it is a not-used import. since we are touching this file, what about also removing it btw?



##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -77,7 +121,7 @@ def test_duplicated_column_names(self):
         self.assertEqual(2, row[1])
         self.assertEqual("Row(c=1, c=2)", str(row))
         # Cannot access columns
-        self.assertRaises(AnalysisException, lambda: df.select(df[0]).first())
+        # self.assertRaises(AnalysisException, lambda: df.select(df[0]).first())

Review Comment:
   nice, will remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #42828:
URL: https://github.com/apache/spark/pull/42828#issuecomment-1719189038

   merged to master, thanks @dongjoon-hyun and @HyukjinKwon for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1338278235


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -63,6 +62,51 @@
 
 
 class DataFrameTestsMixin:
+    def test_getitem_invalid_indices(self):
+        df = self.spark.sql(
+            "SELECT * FROM VALUES "
+            "(1, 1.1, 'a'), "
+            "(2, 2.2, 'b'), "
+            "(4, 4.4, 'c') "
+            "AS TAB(a, b, c)"
+        )
+
+        # accepted type and values
+        for index in [False, True, 0, 1, 2, -1, -2, -3]:
+            df[index]

Review Comment:
   have offline discussion with wenchen, will fix it by switching to `SortOrder(Literal(index))`. Will fix it next week.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1320283304


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -77,7 +121,7 @@ def test_duplicated_column_names(self):
         self.assertEqual(2, row[1])
         self.assertEqual("Row(c=1, c=2)", str(row))
         # Cannot access columns
-        self.assertRaises(AnalysisException, lambda: df.select(df[0]).first())
+        # self.assertRaises(AnalysisException, lambda: df.select(df[0]).first())

Review Comment:
   Why don't we delete this? Is this comment required still?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1320282854


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -26,7 +26,6 @@
 import io
 from contextlib import redirect_stdout
 
-from pyspark import StorageLevel

Review Comment:
   Just a question. Is this a relevant change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1318133426


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -1397,6 +1399,16 @@ class SparkConnectPlanner(val sessionHolder: SessionHolder) extends Logging {
     expr
   }
 
+  private def transformGetColumnByOrdinal(
+      attr: proto.Expression.GetColumnByOrdinal): GetColumnByOrdinal = {
+    // always set dataType field null, since it is not used in Analyzer
+    val expr = GetColumnByOrdinal(attr.getOrdinal, null)

Review Comment:
   also cc @cloud-fan for the usage of `GetColumnByOrdinal`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #42828: [SPARK-45088][PYTHON][CONNECT] Make `getitem` work with duplicated columns

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #42828:
URL: https://github.com/apache/spark/pull/42828#discussion_r1338141627


##########
python/pyspark/sql/tests/test_dataframe.py:
##########
@@ -63,6 +62,51 @@
 
 
 class DataFrameTestsMixin:
+    def test_getitem_invalid_indices(self):
+        df = self.spark.sql(
+            "SELECT * FROM VALUES "
+            "(1, 1.1, 'a'), "
+            "(2, 2.2, 'b'), "
+            "(4, 4.4, 'c') "
+            "AS TAB(a, b, c)"
+        )
+
+        # accepted type and values
+        for index in [False, True, 0, 1, 2, -1, -2, -3]:
+            df[index]

Review Comment:
   This is really a bad API. `df.col` can be ambiguous as people may use the column reference far away from the dataframe, e.g. `df1.join(df2).select...filter...select(df1.col)`. We recommend users use qualified unresolved column instead, like `col("t1.col")`. Now `df[index]` is even worse as it only makes sense to use it immediately in current df's transformation.
   
   Why do we add such an API? To support order by ordinal, we can just order by integer literals. The SQL parser also parses `ORDER BY 1, 2` as ordering by integer literal 1 and 2, and analyzer will properly resolve it.
   
   cc @HyukjinKwon @zhengruifeng 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org