You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "khalidmammadov (via GitHub)" <gi...@apache.org> on 2023/02/14 13:59:52 UTC

[GitHub] [spark] khalidmammadov opened a new pull request, #40015: [WIP][SPARK-42437][PySpark][Connect] PySpark catalog.cacheTable will allow to specify storage level

khalidmammadov opened a new pull request, #40015:
URL: https://github.com/apache/spark/pull/40015

   Currently PySpark version of `catalog.cacheTable` function does not support to specify storage level. This is to add that.
   
   ### What changes were proposed in this pull request?
   Add extra parameter to catalog.cacheTable
   
   
   ### Why are the changes needed?
   To allow users specify which storage level to use in cache in PySpark/Connect code
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes
   
   ### How was this patch tested?
   Updated existing test cases 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] khalidmammadov commented on pull request #40015: [SPARK-42437][PySpark][Connect] PySpark catalog.cacheTable will allow to specify storage level

Posted by "khalidmammadov (via GitHub)" <gi...@apache.org>.
khalidmammadov commented on PR #40015:
URL: https://github.com/apache/spark/pull/40015#issuecomment-1435639551

   @HyukjinKwon can you please check if it looks Ok?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] khalidmammadov commented on pull request #40015: [SPARK-42437][PYSPARK][CONNECT] PySpark catalog.cacheTable will allow to specify storage level

Posted by "khalidmammadov (via GitHub)" <gi...@apache.org>.
khalidmammadov commented on PR #40015:
URL: https://github.com/apache/spark/pull/40015#issuecomment-1495403951

   Hi, thanks for letting me know. I will look at it
   
   On Mon, 3 Apr 2023, 21:15 Takuya UESHIN, ***@***.***> wrote:
   
   > Hi @khalidmammadov <https://github.com/khalidmammadov>, now that Catalog
   > in Scala client including protobuf definition has been implemented, do you
   > want to continue working on this?
   > Otherwise, I can take this over.
   > Thanks.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/spark/pull/40015#issuecomment-1494921787>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/ACYJ3NHUQOOWBY4SFT7D3XDW7MVVDANCNFSM6AAAAAAU3TDHCU>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] khalidmammadov commented on a diff in pull request #40015: [SPARK-42437][PYTHON][CONNECT] PySpark catalog.cacheTable will allow to specify storage level

Posted by "khalidmammadov (via GitHub)" <gi...@apache.org>.
khalidmammadov commented on code in PR #40015:
URL: https://github.com/apache/spark/pull/40015#discussion_r1161268454


##########
python/pyspark/sql/connect/plan.py:
##########
@@ -1830,14 +1831,24 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation:
 
 
 class CacheTable(LogicalPlan):
-    def __init__(self, table_name: str) -> None:
+    def __init__(self, table_name: str, storage_level: Optional[StorageLevel] = None) -> None:
         super().__init__(None)
         self._table_name = table_name
-
-    def plan(self, session: "SparkConnectClient") -> proto.Relation:
-        plan = proto.Relation(
-            catalog=proto.Catalog(cache_table=proto.CacheTable(table_name=self._table_name))
-        )
+        self._storage_level = storage_level
+
+    def plan(self, session: "SparkConnectClient") -> proto.Relation:
+        _cache_table = proto.CacheTable(table_name=self._table_name)
+        if self._storage_level:
+            _cache_table.storage_level.CopyFrom(
+                proto.StorageLevel(
+                    use_disk=self._storage_level.useDisk,
+                    use_memory=self._storage_level.useMemory,
+                    use_off_heap=self._storage_level.useOffHeap,
+                    deserialized=self._storage_level.deserialized,
+                    replication=self._storage_level.replication,
+                )

Review Comment:
   Shall we do this as a follow up, rather than here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] khalidmammadov commented on pull request #40015: [SPARK-42437][PySpark][Connect] PySpark catalog.cacheTable will allow to specify storage level

Posted by "khalidmammadov (via GitHub)" <gi...@apache.org>.
khalidmammadov commented on PR #40015:
URL: https://github.com/apache/spark/pull/40015#issuecomment-1441644663

   Closing as no traction 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #40015: [SPARK-42437][PySpark][Connect] PySpark catalog.cacheTable will allow to specify storage level

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #40015:
URL: https://github.com/apache/spark/pull/40015#issuecomment-1442772377

   Sorry for late responses. We should better have this feature parity. But I need to check w/ the concern about storage level (raised internally). I will ping the guy to comment here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] khalidmammadov commented on pull request #40015: [SPARK-42437][PYTHON][CONNECT] PySpark catalog.cacheTable will allow to specify storage level

Posted by "khalidmammadov (via GitHub)" <gi...@apache.org>.
khalidmammadov commented on PR #40015:
URL: https://github.com/apache/spark/pull/40015#issuecomment-1505902017

   Thanks @ueshin!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] khalidmammadov commented on pull request #40015: [SPARK-42437][PYTHON][CONNECT] PySpark catalog.cacheTable will allow to specify storage level

Posted by "khalidmammadov (via GitHub)" <gi...@apache.org>.
khalidmammadov commented on PR #40015:
URL: https://github.com/apache/spark/pull/40015#issuecomment-1501103732

   The build was failing due to https://github.com/apache/spark/pull/40674 and now fixed by https://github.com/apache/spark/pull/40681
   
   Rebased.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin closed pull request #40015: [SPARK-42437][PYTHON][CONNECT] PySpark catalog.cacheTable will allow to specify storage level

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin closed pull request #40015: [SPARK-42437][PYTHON][CONNECT] PySpark catalog.cacheTable will allow to specify storage level
URL: https://github.com/apache/spark/pull/40015


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #40015: [SPARK-42437][PYTHON][CONNECT] PySpark catalog.cacheTable will allow to specify storage level

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #40015:
URL: https://github.com/apache/spark/pull/40015#discussion_r1160908837


##########
python/pyspark/sql/connect/plan.py:
##########
@@ -1830,14 +1831,24 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation:
 
 
 class CacheTable(LogicalPlan):
-    def __init__(self, table_name: str) -> None:
+    def __init__(self, table_name: str, storage_level: Optional[StorageLevel] = None) -> None:
         super().__init__(None)
         self._table_name = table_name
-
-    def plan(self, session: "SparkConnectClient") -> proto.Relation:
-        plan = proto.Relation(
-            catalog=proto.Catalog(cache_table=proto.CacheTable(table_name=self._table_name))
-        )
+        self._storage_level = storage_level
+
+    def plan(self, session: "SparkConnectClient") -> proto.Relation:
+        _cache_table = proto.CacheTable(table_name=self._table_name)
+        if self._storage_level:
+            _cache_table.storage_level.CopyFrom(
+                proto.StorageLevel(
+                    use_disk=self._storage_level.useDisk,
+                    use_memory=self._storage_level.useMemory,
+                    use_off_heap=self._storage_level.useOffHeap,
+                    deserialized=self._storage_level.deserialized,
+                    replication=self._storage_level.replication,
+                )

Review Comment:
   Shall we extract functions to convert to/from `proto.StorageLevel` and reuse them in `client.py` as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] khalidmammadov commented on a diff in pull request #40015: [SPARK-42437][PySpark][Connect] PySpark catalog.cacheTable will allow to specify storage level

Posted by "khalidmammadov (via GitHub)" <gi...@apache.org>.
khalidmammadov commented on code in PR #40015:
URL: https://github.com/apache/spark/pull/40015#discussion_r1125615556


##########
connector/connect/common/src/main/protobuf/spark/connect/types.proto:
##########
@@ -184,3 +184,15 @@ message DataType {
     DataType sql_type = 5;
   }
 }
+
+enum StorageLevel {

Review Comment:
   @zhengruifeng thanks for review. 
   I have removed enum and static mappings. 
   Now, storage level is resolved based on user input. It allows Python users to use PySpark constants or customise if needed.  This the same logic how PySpark accepts and resolves storage level from user.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] khalidmammadov commented on pull request #40015: [SPARK-42437][PySpark][Connect] PySpark catalog.cacheTable will allow to specify storage level

Posted by "khalidmammadov (via GitHub)" <gi...@apache.org>.
khalidmammadov commented on PR #40015:
URL: https://github.com/apache/spark/pull/40015#issuecomment-1444167507

   Thanks @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a diff in pull request #40015: [SPARK-42437][PYTHON][CONNECT] PySpark catalog.cacheTable will allow to specify storage level

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #40015:
URL: https://github.com/apache/spark/pull/40015#discussion_r1163280656


##########
python/pyspark/sql/catalog.py:
##########
@@ -917,25 +919,34 @@ def isCached(self, tableName: str) -> bool:
         """
         return self._jcatalog.isCached(tableName)
 
-    def cacheTable(self, tableName: str) -> None:
-        """Caches the specified table in-memory.
+    def cacheTable(self, tableName: str, storageLevel: Optional[StorageLevel] = None) -> None:
+        """Caches the specified table in-memory or with given storage level.
+        Default MEMORY_AND_DISK.
 
         .. versionadded:: 2.0.0
 
         Parameters
         ----------
         tableName : str
             name of the table to get.
+        storageLevel : :class:`StorageLevel`
+            storage level to set for persistence.
 
             .. versionchanged:: 3.4.0
                 Allow ``tableName`` to be qualified with catalog name.
+            .. versionchanged:: 3.5.0
+                Allow to specify storage level.

Review Comment:
   ```suggestion
           tableName : str
               name of the table to get.
   
               .. versionchanged:: 3.4.0
                   Allow ``tableName`` to be qualified with catalog name.
   
           storageLevel : :class:`StorageLevel`
               storage level to set for persistence.
   
               .. versionchanged:: 3.5.0
                   Allow to specify storage level.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on pull request #40015: [SPARK-42437][PYSPARK][CONNECT] PySpark catalog.cacheTable will allow to specify storage level

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #40015:
URL: https://github.com/apache/spark/pull/40015#issuecomment-1494921787

   Hi @khalidmammadov, now that `Catalog` in Scala client including protobuf definition has been implemented, do you want to continue working on this?
   Otherwise, I can take this over.
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] khalidmammadov commented on pull request #40015: [SPARK-42437][PySpark][Connect] PySpark catalog.cacheTable will allow to specify storage level

Posted by "khalidmammadov (via GitHub)" <gi...@apache.org>.
khalidmammadov commented on PR #40015:
URL: https://github.com/apache/spark/pull/40015#issuecomment-1438674170

   @HyukjinKwon @amaliujia can you please review? 
   If you think implementation of StorageLevel is not right I will close and can open a new one for the param only (when there's alternative implementation is available). The problem is you cant just add something to PySpark without touching Connect. I guess it's intentional.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] khalidmammadov closed pull request #40015: [SPARK-42437][PySpark][Connect] PySpark catalog.cacheTable will allow to specify storage level

Posted by "khalidmammadov (via GitHub)" <gi...@apache.org>.
khalidmammadov closed pull request #40015: [SPARK-42437][PySpark][Connect] PySpark catalog.cacheTable will allow to specify storage level
URL: https://github.com/apache/spark/pull/40015


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on pull request #40015: [SPARK-42437][PYTHON][CONNECT] PySpark catalog.cacheTable will allow to specify storage level

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #40015:
URL: https://github.com/apache/spark/pull/40015#issuecomment-1505864295

   Thanks! merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40015: [SPARK-42437][PySpark][Connect] PySpark catalog.cacheTable will allow to specify storage level

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40015:
URL: https://github.com/apache/spark/pull/40015#discussion_r1119515771


##########
connector/connect/common/src/main/protobuf/spark/connect/types.proto:
##########
@@ -184,3 +184,15 @@ message DataType {
     DataType sql_type = 5;
   }
 }
+
+enum StorageLevel {

Review Comment:
   the problem is that the definitions are different between PySpark and SQL:
   
   PySpark: `MEMORY_AND_DISK = useDisk=True, useMemory=True, useOffHeap=False, deserialized=False`
   SQL: `MEMORY_AND_DISK = _useDisk=true, _useMemory=true, _useOffHeap=false, _deserialized=true`
   
   also `MEMORY_AND_DISK_DESER ` was dedicated for PySpark, it is equivalent to `MEMORY_AND_DISK` in Scala side.
   
   cc @cloud-fan @HyukjinKwon @amaliujia 
   
   also please follow [the guide](https://protobuf.dev/programming-guides/style/#enums) to define an `enum`, it should be like:
   ```
   enum StorageLevel {
       STORAGE_LEVEL_UNSPECIFIED = 0;
       STORAGE_LEVEL_ DISK_ONLY = 1;
      ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #40015: [WIP][SPARK-42437][PySpark][Connect] PySpark catalog.cacheTable will allow to specify storage level

Posted by "amaliujia (via GitHub)" <gi...@apache.org>.
amaliujia commented on PR #40015:
URL: https://github.com/apache/spark/pull/40015#issuecomment-1431992781

   cc @zhengruifeng IIRC there were some open questions to support storage level in Connect?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org