You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "mjf-89 (via GitHub)" <gi...@apache.org> on 2023/02/02 08:26:43 UTC

[GitHub] [iceberg] mjf-89 commented on issue #5977: How to write to a bucket-partitioned table using PySpark?

mjf-89 commented on issue #5977:
URL: https://github.com/apache/iceberg/issues/5977#issuecomment-1413328965

   Hi, I don't have a deep understandin of pyspark internals but I think that you can write to a partitioned iceberg table with the following approach:
   
   ```
   # registering iceberg udf for partition transformation bucket(32,string_column)
   spark.sparkContext._jvm.org.apache.iceberg.spark.IcebergSpark.registerBucketUDF(spark._jsparkSession,'iceberg_bucket_str_32',spark.sparkContext._jvm.org.apache.spark.sql.types.DataTypes.StringType,32)
   
   # sorting using the registerd udf and write to the partitioned iceberg table
   df.sortWithinPartitions(F.expr("iceberg_bucket_str_32(string_column)")) \
     .writeTo("iceberg_table") \
     .using("iceberg") \
     .partitionedBy(F.bucket(32,"string_column")) 
   ```
   Another option as far as I understand is to avoid sorting and use the fanout writer instead:
   ```
   df.writeTo("iceberg_table") \
     .using('iceberg') \
     .option("fanout-enabled", "true") \
     .partitionedBy(F.bucket(32,"esum_id")) \
     .createOrReplace()
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org