You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gaurav Agarwal <ga...@gmail.com> on 2023/06/15 09:06:30 UTC

Fwd: iceberg queries

Hi Team,

Sample Merge query:

df.createOrReplaceTempView("source")

MERGE INTO iceberg_hive_cat.iceberg_poc_db.iceberg_tab target
USING (SELECT * FROM source)
ON target.col1 = source.col1// this is my bucket column
WHEN MATCHED  THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

The source dataset is a temporary view and it contains 1.5 million records
in future can 20 Million rows and with id that have 16 buckets.
The target iceberg table has 16 buckets . The source dataset will only
update if matched and insert if not matched with those id

I have 1700 columns in my table.

spark dataset is using default partitioning , do we need to bucket the
spark dataset on bucket column as well ?

Let me know if you need any further details.

it fails with OOME ,

Regards
Gaurav