You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "maddiedawson (via GitHub)" <gi...@apache.org> on 2023/07/11 23:16:17 UTC

[GitHub] [spark] maddiedawson commented on a diff in pull request #41946: [WIP] FunctionPickler Class

maddiedawson commented on code in PR #41946:
URL: https://github.com/apache/spark/pull/41946#discussion_r1260373715


##########
python/pyspark/ml/util.py:
##########
@@ -760,3 +762,127 @@ def _get_active_session(is_remote: bool) -> SparkSession:
     if spark is None:
         raise RuntimeError("An active SparkSession is required for the distributor.")
     return spark
+
+
+class FunctionPickler:
+    """ 
+        This class provides a way to pickle a function and its arguments.
+        It also provides a way to create a pytorch script that can run a
+        function with arguments if they have pickled to a file.
+    """
+    @staticmethod
+    def pickle_func_and_get_path(train_fn: Callable, file_path: str, save_dir: str, *args, **kwargs) -> str:

Review Comment:
   I woul rename this "pickle_fn_and_save"



##########
python/pyspark/ml/util.py:
##########
@@ -760,3 +762,127 @@ def _get_active_session(is_remote: bool) -> SparkSession:
     if spark is None:
         raise RuntimeError("An active SparkSession is required for the distributor.")
     return spark
+
+
+class FunctionPickler:
+    """ 
+        This class provides a way to pickle a function and its arguments.
+        It also provides a way to create a pytorch script that can run a
+        function with arguments if they have pickled to a file.
+    """

Review Comment:
   None of these functions are specific to training I would say. Let's reword the comments to not mention training



##########
python/pyspark/ml/util.py:
##########
@@ -760,3 +762,127 @@ def _get_active_session(is_remote: bool) -> SparkSession:
     if spark is None:
         raise RuntimeError("An active SparkSession is required for the distributor.")
     return spark
+
+
+class FunctionPickler:
+    """ 
+        This class provides a way to pickle a function and its arguments.
+        It also provides a way to create a pytorch script that can run a
+        function with arguments if they have pickled to a file.
+    """
+    @staticmethod
+    def pickle_func_and_get_path(train_fn: Callable, file_path: str, save_dir: str, *args, **kwargs) -> str:
+        """
+            Given a training function and args, this function will pickle them to a file. 
+
+            Parameters
+            ----------
+            train_fn: Callable
+                The picklable function that will be pickled to a file.
+
+            file_path: str
+                The path where to save the pickled function, args, and kwargs. If its the 
+                empty string, the function will decide on a random name.
+
+            save_dir: str
+                The directory in which to save the file with the pickled function and arguments.
+                Does nothing if the path is specified. If both file_path and save_dir are empty,
+                the function will write the file to the current working directory with a random 
+                name.
+
+            *args: Any
+                Arguments of train_fn  that will be pickled.
+
+            **kwargs: Any
+                Key word arguments to train_fn that will be pickled.
+
+            Returns
+            -------
+            str:
+                The path to the file where the function and arguments are pickled.
+        """
+        if file_path != "":
+            with open(file_path, "wb") as f:
+                cloudpickle.dump((train_fn, args, kwargs), f)
+                return f.name
+
+        if save_dir == "":
+            save_dir = os.getcwd()
+
+        with tempfile.NamedTemporaryFile(dir=save_dir, delete=False) as f:
+            cloudpickle.dump((train_fn, args, kwargs), f)
+            return f.name
+
+    @staticmethod
+    def create_training_script_from_func(fn_output_save_path: str, training_script_save_path: str, pickled_fn_path: str, prefix_code:str ="", suffix_code:str = "") -> str:
+        """
+            Given a file containing a pickled function and arguments, this function will create a pytorch file
+            that will execute the function and pickle the functions outputs.
+
+            Parameters
+            ----------
+            fn_output_save_path: str
+                This is the location where the created file will save the pickled output of the function.
+
+            training_script_save_path: str
+                This is the path which will be used for the created pytorch file.
+
+            pickled_fn_path:
+                This is the path of the file containing the pickled function, args, and kwargs.
+            
+            prefix_code: str
+                This contains a string that the user can pass in which will be executed before
+                the code generated by this class to execute the function and save it. If 
+                prefix_code is the empty string, nothing will be written before the auto-
+                generated code.
+
+            suffix_code: str
+                This contains a string of code that the user can pass in which will be executed
+                after the code generated by this class finishes executing. If suffix_code is 
+                the empty string, nothing will be written after the auto-generated code.
+            
+            Returns
+            -------
+            str:
+                The path to the location of the newly created pytorch file.
+        """
+
+        code_snippet =  textwrap.dedent(
+            f"""
+                    from pyspark import cloudpickle
+                    import os
+
+                    if __name__ == "__main__":
+                        with open("{pickled_fn_path}", "rb") as f:
+                            train_fn, args, kwargs = cloudpickle.load(f)
+                        output = train_fn(*args, **kwargs)
+                        with open("{fn_output_save_path}", "wb") as f:
+                            cloudpickle.dump(output, f)
+                    """
+        )
+        with open(training_script_save_path, "w") as f:
+            if prefix_code != "":
+                f.write(prefix_code)
+            f.write(code_snippet)
+        
+        return training_script_save_path
+    
+    
+    @staticmethod
+    def get_pickled_output_from_training_func(func_output_path: str) -> Any:

Review Comment:
   Rename this to get_fn_output (output is not pickled)



##########
python/pyspark/ml/util.py:
##########
@@ -760,3 +762,127 @@ def _get_active_session(is_remote: bool) -> SparkSession:
     if spark is None:
         raise RuntimeError("An active SparkSession is required for the distributor.")
     return spark
+
+
+class FunctionPickler:
+    """ 
+        This class provides a way to pickle a function and its arguments.
+        It also provides a way to create a pytorch script that can run a
+        function with arguments if they have pickled to a file.
+    """
+    @staticmethod
+    def pickle_func_and_get_path(train_fn: Callable, file_path: str, save_dir: str, *args, **kwargs) -> str:
+        """
+            Given a training function and args, this function will pickle them to a file. 
+
+            Parameters
+            ----------
+            train_fn: Callable
+                The picklable function that will be pickled to a file.
+
+            file_path: str
+                The path where to save the pickled function, args, and kwargs. If its the 
+                empty string, the function will decide on a random name.
+
+            save_dir: str
+                The directory in which to save the file with the pickled function and arguments.
+                Does nothing if the path is specified. If both file_path and save_dir are empty,
+                the function will write the file to the current working directory with a random 
+                name.
+
+            *args: Any
+                Arguments of train_fn  that will be pickled.
+
+            **kwargs: Any
+                Key word arguments to train_fn that will be pickled.
+
+            Returns
+            -------
+            str:
+                The path to the file where the function and arguments are pickled.
+        """
+        if file_path != "":
+            with open(file_path, "wb") as f:
+                cloudpickle.dump((train_fn, args, kwargs), f)
+                return f.name
+
+        if save_dir == "":
+            save_dir = os.getcwd()
+
+        with tempfile.NamedTemporaryFile(dir=save_dir, delete=False) as f:
+            cloudpickle.dump((train_fn, args, kwargs), f)
+            return f.name
+
+    @staticmethod
+    def create_training_script_from_func(fn_output_save_path: str, training_script_save_path: str, pickled_fn_path: str, prefix_code:str ="", suffix_code:str = "") -> str:

Review Comment:
   I would name this "create_fn_run_script" or something like that. I would also reorder the args to reflect their order of appearance:
   
   1) pickled_fn_path is loaded
   2) fn_output_save_path (rename this to fn_output_path) is written with output
   3) training_script_save_path (rename this to script_path) is written with the script



##########
python/pyspark/ml/util.py:
##########
@@ -760,3 +762,127 @@ def _get_active_session(is_remote: bool) -> SparkSession:
     if spark is None:
         raise RuntimeError("An active SparkSession is required for the distributor.")
     return spark
+
+
+class FunctionPickler:
+    """ 
+        This class provides a way to pickle a function and its arguments.
+        It also provides a way to create a pytorch script that can run a
+        function with arguments if they have pickled to a file.
+    """
+    @staticmethod
+    def pickle_func_and_get_path(train_fn: Callable, file_path: str, save_dir: str, *args, **kwargs) -> str:
+        """
+            Given a training function and args, this function will pickle them to a file. 
+
+            Parameters
+            ----------
+            train_fn: Callable
+                The picklable function that will be pickled to a file.
+
+            file_path: str
+                The path where to save the pickled function, args, and kwargs. If its the 
+                empty string, the function will decide on a random name.
+
+            save_dir: str
+                The directory in which to save the file with the pickled function and arguments.
+                Does nothing if the path is specified. If both file_path and save_dir are empty,
+                the function will write the file to the current working directory with a random 
+                name.

Review Comment:
   If they're optional, given them default values of "" in the function def



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org