You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/10/30 10:33:05 UTC
[PR] [SPARK-45723][PYTHON][CONNECT] Catalog methods avoid pandas conversion [spark]
zhengruifeng opened a new pull request, #43583:
URL: https://github.com/apache/spark/pull/43583
### What changes were proposed in this pull request?
Catalog methods avoid pandas conversion
### Why are the changes needed?
minor optimization:
before: arrow table -> pandas dataframe -> scalar result
after: arrow table -> scalar result
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ci
### Was this patch authored or co-authored using generative AI tooling?
no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45723][PYTHON][CONNECT] Catalog methods avoid pandas conversion [spark]
Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43583:
URL: https://github.com/apache/spark/pull/43583#issuecomment-1784975858
Merged to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45723][PYTHON][CONNECT] Catalog methods avoid pandas conversion [spark]
Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #43583:
URL: https://github.com/apache/spark/pull/43583#issuecomment-1784908592
cc @HyukjinKwon
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45723][PYTHON][CONNECT] Catalog methods avoid pandas conversion [spark]
Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #43583:
URL: https://github.com/apache/spark/pull/43583#discussion_r1375995141
##########
python/pyspark/sql/connect/catalog.py:
##########
@@ -83,138 +85,125 @@ def setCurrentDatabase(self, dbName: str) -> None:
setCurrentDatabase.__doc__ = PySparkCatalog.setCurrentDatabase.__doc__
def listDatabases(self, pattern: Optional[str] = None) -> List[Database]:
- pdf = self._execute_and_fetch(plan.ListDatabases(pattern=pattern))
+ table = self._execute_and_fetch(plan.ListDatabases(pattern=pattern))
return [
Database(
- name=row.iloc[0],
- catalog=row.iloc[1],
- description=row.iloc[2],
- locationUri=row.iloc[3],
+ name=table[0][i].as_py(),
+ catalog=table[1][i].as_py(),
+ description=table[2][i].as_py(),
+ locationUri=table[3][i].as_py(),
)
- for _, row in pdf.iterrows()
+ for i in range(table.num_rows)
]
listDatabases.__doc__ = PySparkCatalog.listDatabases.__doc__
def getDatabase(self, dbName: str) -> Database:
- pdf = self._execute_and_fetch(plan.GetDatabase(db_name=dbName))
- assert pdf is not None
- row = pdf.iloc[0]
+ table = self._execute_and_fetch(plan.GetDatabase(db_name=dbName))
return Database(
- name=row[0],
- catalog=row[1],
- description=row[2],
- locationUri=row[3],
+ name=table[0][0].as_py(),
+ catalog=table[1][0].as_py(),
+ description=table[2][0].as_py(),
+ locationUri=table[3][0].as_py(),
)
getDatabase.__doc__ = PySparkCatalog.getDatabase.__doc__
def databaseExists(self, dbName: str) -> bool:
- pdf = self._execute_and_fetch(plan.DatabaseExists(db_name=dbName))
- assert pdf is not None
- return pdf.iloc[0].iloc[0]
+ table = self._execute_and_fetch(plan.DatabaseExists(db_name=dbName))
+ return table[0][0].as_py()
databaseExists.__doc__ = PySparkCatalog.databaseExists.__doc__
def listTables(
self, dbName: Optional[str] = None, pattern: Optional[str] = None
) -> List[Table]:
- pdf = self._execute_and_fetch(plan.ListTables(db_name=dbName, pattern=pattern))
+ table = self._execute_and_fetch(plan.ListTables(db_name=dbName, pattern=pattern))
return [
Table(
- name=row.iloc[0],
- catalog=row.iloc[1],
- # If None, returns None.
- namespace=None if row.iloc[2] is None else list(row.iloc[2]),
- description=row.iloc[3],
- tableType=row.iloc[4],
- isTemporary=row.iloc[5],
+ name=table[0][i].as_py(),
+ catalog=table[1][i].as_py(),
+ namespace=table[2][i].as_py(),
+ description=table[3][i].as_py(),
+ tableType=table[4][i].as_py(),
+ isTemporary=table[5][i].as_py(),
)
- for _, row in pdf.iterrows()
+ for i in range(table.num_rows)
]
listTables.__doc__ = PySparkCatalog.listTables.__doc__
def getTable(self, tableName: str) -> Table:
- pdf = self._execute_and_fetch(plan.GetTable(table_name=tableName))
- assert pdf is not None
- row = pdf.iloc[0]
+ table = self._execute_and_fetch(plan.GetTable(table_name=tableName))
return Table(
- name=row.iloc[0],
- catalog=row.iloc[1],
- # If None, returns None.
- namespace=None if row.iloc[2] is None else list(row.iloc[2]),
- description=row.iloc[3],
- tableType=row.iloc[4],
- isTemporary=row.iloc[5],
+ name=table[0][0].as_py(),
+ catalog=table[1][0].as_py(),
+ namespace=table[2][0].as_py(),
Review Comment:
`table[2][0]` returns a `pyarrow.StringScalar` value here.
If the underlying value is None, `table[2][0]` is `<pyarrow.StringScalar: None>` and its `as_py()` is None.
So we no longer need check `None` for optional field.
```
In [11]: pdf = pd.DataFrame({"A": ["x", "y", "z", None]})
In [12]: table = pa.Table.from_pandas(pdf)
In [13]: table
Out[13]:
pyarrow.Table
A: string
----
A: [["x","y","z",null]]
In [14]: table[0][3]
Out[14]: <pyarrow.StringScalar: None>
In [15]: table[0][3].as_py()
In [16]: table[0][3].as_py() is None
Out[16]: True
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
Re: [PR] [SPARK-45723][PYTHON][CONNECT] Catalog methods avoid pandas conversion [spark]
Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #43583: [SPARK-45723][PYTHON][CONNECT] Catalog methods avoid pandas conversion
URL: https://github.com/apache/spark/pull/43583
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org