You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "WeichenXu123 (via GitHub)" <gi...@apache.org> on 2023/05/04 08:43:11 UTC

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40896: [SPARK-43229][ML][PYTHON][CONNECT] Introduce Barrier Python UDF

WeichenXu123 commented on code in PR #40896:
URL: https://github.com/apache/spark/pull/40896#discussion_r1184721358


##########
python/pyspark/sql/udf.py:
##########
@@ -249,6 +259,38 @@ def __init__(
         self.evalType = evalType
         self.deterministic = deterministic
 
+        # since 3.5.0, we introduce an internal optional function attribute '_is_barrier',
+        # which is dedicated for integration with external ML training frameworks including
+        # PyTorch and XGBoost.
+        # It indicates whether this UDF will be executed on barrier mode, and is only accepted
+        # in methods 'mapInPandas' and 'mapInArrow'.
+        # For example:
+        #
+        # df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
+        #
+        # def filter_func(iterator):
+        #     for pdf in iterator:
+        #         yield pdf[pdf.id == 1]
+        #
+        # filter_func._is_barrier = True # Mark this UDF is barrier

Review Comment:
   Define a `@barrier` decorator looks better, and we can document `barrier` doc saying this is a developer API and we should keep it stable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org