You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/09/07 14:58:00 UTC

[GitHub] [iceberg] mrendi29 opened a new issue, #5721: Registering BucketUDF on PySpark

mrendi29 opened a new issue, #5721:
URL: https://github.com/apache/iceberg/issues/5721

   ### Query engine
   
   Apache Spark (PySpark)
   
   ### Question
   
   On https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables if you need to register the bucket UDF you can do so by:
   
   ```
   import org.apache.iceberg.spark.IcebergSpark
   import org.apache.spark.sql.types.DataTypes
   
   IcebergSpark.registerBucketUDF(spark, "iceberg_bucket16", DataTypes.LongType, 16)
   ```
   How would we do this in PySpark? Does the method below work or is there another suggested method? 
   
   ```
   from pyspark.sql.types import  LongType
   from pyspark.sql import functions as F
   
   spark.udf.register("iceberg_bucket16", F.bucket(16), LongType())
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] TechTinkerer42 commented on issue #5721: Registering BucketUDF on PySpark

Posted by "TechTinkerer42 (via GitHub)" <gi...@apache.org>.

TechTinkerer42 commented on issue #5721:
URL: https://github.com/apache/iceberg/issues/5721#issuecomment-1694941784

   Here is an example of Pyspark code to register a bucketing UDF. @mrendi29, I hope you have figured out your issue by now, but I'm sharing this in case it helps someone else.
   
   ` # Register bucket UDF
       jvm_gateway = spark.sparkContext._gateway.jvm
       iceberg_spark = jvm_gateway.org.apache.iceberg.spark.IcebergSpark
       data_types = jvm_gateway.org.apache.spark.sql.types.DataTypes
       # 100 is the number of buckets
       iceberg_spark.registerBucketUDF(spark._jsparkSession, "iceberg_bucket", data_types.StringType, 100)
       `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] fireking77 commented on issue #5721: Registering BucketUDF on PySpark

Posted by "fireking77 (via GitHub)" <gi...@apache.org>.

fireking77 commented on issue #5721:
URL: https://github.com/apache/iceberg/issues/5721#issuecomment-1407757922

   I would also curious about this question!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org