You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "HuangFru (via GitHub)" <gi...@apache.org> on 2023/03/31 02:49:40 UTC

[GitHub] [hudi] HuangFru opened a new issue, #8332: [SUPPORT] Spark insert overwrite causes executors OOM.

HuangFru opened a new issue, #8332:
URL: https://github.com/apache/hudi/issues/8332

   **Describe the problem you faced**
   
   I'm doing a simple write performance test for Hudi in Spark on Yarn, but my executors will be dead for OOM. And the 'insert overwrite' SQL could be very slow.
   I've created a table like this:
   ```
   create table lineitem_kp_mor (
      l_orderkey bigint,
      l_partkey bigint,
      l_suppkey bigint,
      l_linenumber int,
      l_quantity double,
      l_extendedprice double,
      l_discount double,
      l_tax double,
      l_returnflag string,
      l_linestatus string,
      l_shipdate date,
      l_receiptdate date,
      l_shipinstruct string,
      l_shipmode string,
      l_comment string,
      l_commitdate date
   ) USING hudi tblproperties (
   	type = 'mor',
   	primaryKey = 'l_orderkey,l_linenumber',
   	hoodie.memory.merge.max.size = '2004857600000',
   	hoodie.write.markers.type = 'DIRECT',
   	hoodie.sql.insert.mode = 'non-strict'
   ) PARTITIONED BY (l_commitdate);
   ```
   This is one of the TPC-H Benchmark's table `lineitem`, and I add a partition key 'l_commitdate' for it. Then the table will have 2466 partitions.
   Then I'm trying to write into the table throw 'insert overwrite as select from' in Spark SQL. The data size is 'sf100', with about 600 million rows of data, and 15GB in storage.
   
   Here's my command submit to yarn:
   ```
   bin/beeline -u 'jdbc:hive2://localhost:10010/;#spark.master=yarn;spark.yarn.queue=default;spark.executor.instances=11;spark.executor.cores=5;spark.executor.memory=8g;spark.driver.memory=4g'
   ```
   Resources: 11 executors, each 8G memory, each 5 cores, driver memory 4G.
   
   The insert overwrite SQL:
   ```
   insert overwrite table lineitem_kp_mor select l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment,l_commitdate from tpch.sf100.lineitem;
   ```
   The SQL run very slow and will failed due to executor dead.
   Executors:
   ![image](https://user-images.githubusercontent.com/68625618/229009065-176aa657-e4ca-4dc4-a627-37ff2e7cbaeb.png)
   Stages:
   ![image](https://user-images.githubusercontent.com/68625618/229010340-c31e5728-2a26-4ef2-93c2-84e7da185f4b.png)
   Dead Executors' log:
   ![image](https://user-images.githubusercontent.com/68625618/229009250-0b2f2bf1-2b5f-42a4-b21c-c5f623c59fb0.png)
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Prepare a TPC-H data set in Spark.
   2.Run Spark sql on Yarn and do create table and insert overwrite like my SQL.
   
   **Expected behavior**
   
   I don't know if I've missed some important config of something. I'm new to Hudi. Could you please give me some help?
   
   **Environment Description**
   
   * Hudi version : 0.13.0
   
   * Spark version : 3.1.3
   
   * Hive version : 2.1.1
   
   * Hadoop version : 2.7.3
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Driver stacktrace:
   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2303)
   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2252)
   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2251)
   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2251)
   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124)
   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124)
   	at scala.Option.foreach(Option.scala:407)
   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124)
   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2490)
   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2432)
   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2421)
   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:902)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
   	at org.apache.spark.rdd.PairRDDFunctions.$anonfun$countByKey$1(PairRDDFunctions.scala:366)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
   	at org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:366)
   	at org.apache.spark.api.java.JavaPairRDD.countByKey(JavaPairRDD.scala:314)
   	at org.apache.hudi.data.HoodieJavaPairRDD.countByKey(HoodieJavaPairRDD.java:104)
   	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.buildProfile(BaseSparkCommitActionExecutor.java:187)
   	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:156)
   	at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:85)
   	at org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:57)
   	... 36 more
   
   	at org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:69)
   	at org.apache.kyuubi.engine.spark.operation.SparkOperation$$anonfun$onError$1.applyOrElse(SparkOperation.scala:112)
   	at org.apache.kyuubi.engine.spark.operation.SparkOperation$$anonfun$onError$1.applyOrElse(SparkOperation.scala:96)
   	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
   	at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:113)
   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   	at org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:87)
   	at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.org$apache$kyuubi$engine$spark$operation$ExecuteStatement$$executeStatement(ExecuteStatement.scala:89)
   	at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$1.run(ExecuteStatement.scala:125)
   	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20230331101517327
   	at org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:64)
   	at org.apache.hudi.table.action.commit.SparkInsertOverwriteCommitActionExecutor.execute(SparkInsertOverwriteCommitActionExecutor.java:63)
   	at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.insertOverwrite(HoodieSparkCopyOnWriteTable.java:159)
   	at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.insertOverwrite(HoodieSparkCopyOnWriteTable.java:97)
   	at org.apache.hudi.client.SparkRDDWriteClient.insertOverwrite(SparkRDDWriteClient.java:207)
   	at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:215)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:306)
   	at org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:94)
   	at org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:47)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
   	at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3700)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3698)
   	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
   	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
   	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
   	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
   	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
   	at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:94)
   	... 9 more```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1499906715

   2. you can set `hoodie.populate.meta.fields` false have a try, not populate meta to base file


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] HuangFru commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "HuangFru (via GitHub)" <gi...@apache.org>.
HuangFru commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1539775194

   > @HuangFru Did you got the chance to try above.
   
   Not yet, It may be a long time until my next performance test(or I'll forget about it). You can close this issue temporarily. When I do a test next time, I will reopen this issue and give feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1492815828

   @HuangFru yes, much better


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1491262859

   Yes, it seems we still use the `UPSERT` code path for the `INSERT OVERWRITE TABLE` operation, should be optimized to `INSERT` if possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1499137667

   @HuangFru 
   1.if the primaryKey not exist in the table, it will add parquet first in generally, you can confirm it. 
   2.Can you confirm your parallelism? or Can you provide spark ui in stage tab, iceberg and hudi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1492813225

   insert overwrite will still deal with small file, you can use `bulk insert` avoid it to speed up, but it only support to `insert into` which is not idempotent operation, you need truncate partitions is you want to insert. Fortunately, the community has related patch support bulk overwrite, #8076 , you can try it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] HuangFru commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "HuangFru (via GitHub)" <gi...@apache.org>.
HuangFru commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1499891797

   @KnightChess 
   1. I'm not sure what 'primaryKey not exist in the table' means, the data is intact. And what does 'add parquet first' mean? 
   2.
   Hudi:
   SQL(just change the table name):
   ```
   create table lineitem_kp_mor (
      l_orderkey bigint,
      l_partkey bigint,
      l_suppkey bigint,
      l_linenumber int,
      l_quantity double,
      l_extendedprice double,
      l_discount double,
      l_tax double,
      l_returnflag string,
      l_linestatus string,
      l_shipdate date,
      l_receiptdate date,
      l_shipinstruct string,
      l_shipmode string,
      l_comment string,
      l_commitdate date
   ) USING hudi tblproperties (
   	type = 'mor',
   	primaryKey = 'l_orderkey,l_linenumber',
   	hoodie.memory.merge.max.size = '2004857600000',
   	hoodie.write.markers.type = 'DIRECT',
   	hoodie.sql.insert.mode = 'non-strict'
   ) PARTITIONED BY (l_commitdate);
   
   set hoodie.sql.bulk.insert.enable=true;
   set hoodie.sql.insert.mode=non-strict;
   set hoodie.bulkinsert.shuffle.parallelism=200;
   
   insert into lineitem_kp_mor select l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment,l_commitdate from spark_catalog.test_origin.lineitem_sf100 order by l_commitdate;
   ```
   ![image](https://user-images.githubusercontent.com/68625618/230532528-6397f571-9fec-4d93-a333-3dbe096298bb.png)
   ![image](https://user-images.githubusercontent.com/68625618/230532844-070dc110-71c6-452d-82fc-27922362e149.png)
   ![image](https://user-images.githubusercontent.com/68625618/230532879-aa2d7f11-727b-4ef4-a674-85dd64eb631a.png)
   ![image](https://user-images.githubusercontent.com/68625618/230533072-e761e244-3e4e-477a-9777-6c277dd81040.png)
   
   Iceberg:
   ![image](https://user-images.githubusercontent.com/68625618/230533780-35eea907-09d7-444d-9670-a97cf3fbf624.png)
   ![image](https://user-images.githubusercontent.com/68625618/230533801-c87ec10f-69f5-4062-a09f-61b23c55e20d.png)
   ![image](https://user-images.githubusercontent.com/68625618/230533837-68fd69ca-c1f4-41e4-a865-daa22f6586c2.png)
   ![image](https://user-images.githubusercontent.com/68625618/230533936-737d7069-1db8-483f-80b9-54724ac83b8d.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] HuangFru commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "HuangFru (via GitHub)" <gi...@apache.org>.
HuangFru commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1502722572

   @KnightChess 
   Sorry, I'm a little busy recently, I didn't push this thing forward. I'll try it when I'm free.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1499906206

   1. you can try the following steps and observe the data file every time.
   (Assume that the primary key is `ID`)
   first: insert one record: id = 1
   second: insert one record: id = 2
   third: insert one record: id = 1
   fourth: insert two records: id = 1 and 3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [SUPPORT] Spark insert overwrite in partition table causes executors OOM. [hudi]

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.
ad1happy2go commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1744230476

   @nandubatchu What executor-memory you were using? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1502655796

   @HuangFru Does the above configuration solve your problem? I found a article desc the benchmark load, you can refer to it: https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-transparent-tpc-ds-lakehouse-performance-benchmarks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ad1happy2go commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.
ad1happy2go commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1491262656

   @HuangFru Did you tried by increasing executor memory? executor--cores you can still keep 5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] HuangFru commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "HuangFru (via GitHub)" <gi...@apache.org>.
HuangFru commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1491273624

   > Yes, it seems we still use the `UPSERT` code path for the `INSERT OVERWRITE TABLE` operation, should be optimized to `INSERT` if possible.
   
   Any suggestion for me to carry on this? Should I use 'insert into' to write the data? But usually insert into has worse performant than insert overwrite.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] HuangFru commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "HuangFru (via GitHub)" <gi...@apache.org>.
HuangFru commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1491436903

   > @HuangFru Did you tried by increasing executor memory? executor--cores you can still keep 5.
   
   Even if I increase executor memory to 16G, and keep all other resources the same. The same phenomenon will also occur.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] HuangFru commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "HuangFru (via GitHub)" <gi...@apache.org>.
HuangFru commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1491658345

   > I could be wrong, but it is worth to try to run the insert statement on a non-hudi table. I suspect you might get the same error because of too much data.
   
   I've tried it in the Iceberg, works fine. Iceberg takes about 230s, the same resources to do the same 'insert overwrite'.
   And I also tried it in Hive, which took more time but works fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xmubeta commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "xmubeta (via GitHub)" <gi...@apache.org>.
xmubeta commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1491654532

   I could be wrong, but it is worth to try to run the insert statement on a non-hudi table. I suspect you might get the same error because of too much data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] HuangFru commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "HuangFru (via GitHub)" <gi...@apache.org>.
HuangFru commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1493833536

   > insert overwrite will still deal with small file, you can use `bulk insert` avoid it to speed up, but it only support to `insert into` which is not idempotent operation, you need truncate partitions which you want to insert. Fortunately, the community has related patch support bulk overwrite, #8076 , you can try it.
   
   'bulk insert' works fine. According to my understanding, the MOR table will have better performance than the COW table in writing, but I found that the COW table took 450s to finish this and the same MOR table took 500s. Why?
   In addition, the 450s are much worse than the Iceberg's 230s. Are there some configurations I didn't set?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] HuangFru commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "HuangFru (via GitHub)" <gi...@apache.org>.
HuangFru commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1491270176

   > @HuangFru Did you tried by increasing executor memory? executor--cores you can still keep 5.
   I'm doing a performance test so I must control variables, I've tried to decrease the number of executors and increase the executor memory, but still OOM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ad1happy2go commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.
ad1happy2go commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1539457910

   @HuangFru Did you got the chance to try above. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope closed issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope closed issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.
URL: https://github.com/apache/hudi/issues/8332


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] HuangFru commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "HuangFru (via GitHub)" <gi...@apache.org>.
HuangFru commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1492815564

   > insert overwrite will still deal with small file, you can use `bulk insert` avoid it to speed up, but it only support to `insert into` which is not idempotent operation, you need truncate partitions which you want to insert. Fortunately, the community has related patch support bulk overwrite, #8076 , you can try it.
   
   If the table is empty, will the 'insert into' using 'bulk insert' have better performance than 'insert overwrite'?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] HuangFru commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "HuangFru (via GitHub)" <gi...@apache.org>.
HuangFru commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1492816104

   > @HuangFru yes, much better
   
   Thanks for your answer. I'll try.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nandubatchu commented on issue #8332: [SUPPORT] Spark insert overwrite in partition table causes executors OOM.

Posted by "nandubatchu (via GitHub)" <gi...@apache.org>.
nandubatchu commented on issue #8332:
URL: https://github.com/apache/hudi/issues/8332#issuecomment-1740486050

   Hey @ad1happy2go  - I too observed similar OOM behaviour on using `insert_overwrite_table` on reasonably large table (200gb) - works fine with very small tables 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org