You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/15 13:03:02 UTC

[GitHub] [hudi] srinikandi opened a new issue #5047: [SUPPORT]

srinikandi opened a new issue #5047:
URL: https://github.com/apache/hudi/issues/5047


   Hi
   I have been using Apache Hudi Connector for Glue (hudi 0.90 version) and facing small file creation problem while inserting data into a hudi table. The input file is about 17 GB with 313 parquet part files. Each averaging around 70 mb. When I try to insert the data into Hudi table with overwrite option, this ends up creating some 7000 plus parquet part files, each with 6.5 MB. I did utilize the small file size and max file size parameters while writing. 
   here is the config that I used . The intention was to create file sizes between 60 - 80 MB.
    common_config = {
           "hoodie.datasource.write.hive_style_partitioning": "true",
           "className": "org.apache.hudi",
           "hoodie.datasource.hive_sync.use_jdbc": "false",
           "hoodie.datasource.write.recordkey.field": "C1,C2,C3,C4",
           "hoodie.datasource.write.precombine.field": "",
           "hoodie.table.name": "TEST_TABLE",
           "hoodie.datasource.hive_sync.database": "TEST_DATABASE_RAW",
           "hoodie.datasource.hive_sync.table": "TEST_TABLE",
           "hoodie.datasource.hive_sync.enable": "true",
           "hoodie.datasource.write.partitionpath.field": "",
           "hoodie.datasource.hive_sync.support_timestamp": "true",
           "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
           "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
           "hoodie.parquet.small.file.limit": "62914560",
           "hoodie.parquet.max.file.limit": "83886080",
           "hoodie.parquet.block.size": "62914560",
           "hoodie.insert.shuffle.parallelism": "5",
           "hoodie.datasource.write.operation": "insert"
           }
   full_ref_files is list with all of the parquet part file from the input folder.
    full_load_df = spark.read.parquet(*full_ref_files)
   full_load_df.write.format("hudi").options(**conf).mode("overwrite").save(raw_table_path)
   
   Below are the spark logs, where the last line shows that it create some 1000 plus partitions while writing to Hudi table.
   Any insights are deeply appreciated.
   
   2022-03-15T06:42:23.502-05:00 | [Stage 1 (javaToPython at NativeMethodAccessorImpl.java:0):> (0 + 113) / 116]
   -- | --
     | 2022-03-15T06:42:25.396-05:00 | [Stage 1 (javaToPython at NativeMethodAccessorImpl.java:0):> (9 + 107) / 116]
     | 2022-03-15T06:42:26.894-05:00 | [Stage 1 (javaToPython at NativeMethodAccessorImpl.java:0):==> (100 + 16) / 116]
     | 2022-03-15T06:42:28.379-05:00 | [Stage 3 (collect at /tmp/stage-to-raw-etl-glue-job-poc-1.py:105):()(83 + 33) / 116]
     | 2022-03-15T06:43:06.306-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(0 + 116) / 117]
     | 2022-03-15T06:43:12.079-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(1 + 116) / 117]
     | 2022-03-15T06:43:19.568-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(3 + 114) / 117]
     | 2022-03-15T06:43:22.290-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(3 + 114) / 117]
     | 2022-03-15T06:43:23.907-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(5 + 112) / 117]
     | 2022-03-15T06:43:25.782-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(5 + 112) / 117]
     | 2022-03-15T06:43:27.302-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(10 + 107) / 117]
     | 2022-03-15T06:43:31.700-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(11 + 106) / 117]
     | 2022-03-15T06:43:35.656-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(12 + 105) / 117]
     | 2022-03-15T06:44:05.906-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(13 + 104) / 117]
     | 2022-03-15T06:44:15.512-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(13 + 104) / 117]
     | 2022-03-15T06:44:20.853-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(16 + 101) / 117]
     | 2022-03-15T06:44:53.649-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(17 + 100) / 117]
     | 2022-03-15T06:45:08.909-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(17 + 100) / 117]
     | 2022-03-15T06:45:14.894-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(19 + 98) / 117]
     | 2022-03-15T06:45:26.452-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(20 + 97) / 117]
     | 2022-03-15T06:45:36.321-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(21 + 96) / 117]
     | 2022-03-15T06:45:38.712-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(21 + 96) / 117]
     | 2022-03-15T06:45:41.935-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(23 + 94) / 117]
     | 2022-03-15T06:45:43.177-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(25 + 92) / 117]
     | 2022-03-15T06:45:48.221-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(26 + 91) / 117]
     | 2022-03-15T06:46:00.568-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(28 + 89) / 117]
     | 2022-03-15T06:46:06.605-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(28 + 89) / 117]
     | 2022-03-15T06:46:10.268-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(29 + 88) / 117]
     | 2022-03-15T06:46:12.260-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(31 + 86) / 117]
     | 2022-03-15T06:46:15.381-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(33 + 84) / 117]
     | 2022-03-15T06:46:18.751-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(34 + 83) / 117]
     | 2022-03-15T06:46:20.892-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(36 + 81) / 117]
     | 2022-03-15T06:46:22.348-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(36 + 81) / 117]
     | 2022-03-15T06:46:23.410-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(38 + 79) / 117]
     | 2022-03-15T06:46:25.525-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(40 + 77) / 117]
     | 2022-03-15T06:46:27.795-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(45 + 72) / 117]
     | 2022-03-15T06:46:29.397-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(46 + 71) / 117]
     | 2022-03-15T06:46:30.493-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(49 + 68) / 117]
     | 2022-03-15T06:46:31.580-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(50 + 67) / 117]
     | 2022-03-15T06:46:33.033-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(53 + 64) / 117]
     | 2022-03-15T06:46:34.046-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(59 + 58) / 117]
     | 2022-03-15T06:46:35.666-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(64 + 53) / 117]
     | 2022-03-15T06:46:36.910-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(66 + 51) / 117]
     | 2022-03-15T06:46:38.815-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(68 + 49) / 117]
     | 2022-03-15T06:46:40.080-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(69 + 48) / 117]
     | 2022-03-15T06:46:41.088-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(72 + 45) / 117]
     | 2022-03-15T06:46:42.917-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(76 + 41) / 117]
     | 2022-03-15T06:46:44.805-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(80 + 37) / 117]
     | 2022-03-15T06:46:46.980-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(84 + 33) / 117]
     | 2022-03-15T06:46:48.120-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(86 + 31) / 117]
     | 2022-03-15T06:46:49.728-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(89 + 28) / 117]
     | 2022-03-15T06:46:50.893-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(90 + 27) / 117]
     | 2022-03-15T06:46:52.211-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(93 + 24) / 117]
     | 2022-03-15T06:46:53.544-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(97 + 20) / 117]
     | 2022-03-15T06:46:54.871-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(102 + 15) / 117]
     | 2022-03-15T06:46:55.892-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(105 + 12) / 117]
     | 2022-03-15T06:46:58.431-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(107 + 10) / 117]
     | 2022-03-15T06:46:59.667-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(110 + 7) / 117]
     | 2022-03-15T06:47:02.750-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(113 + 4) / 117]
     | 2022-03-15T06:47:14.859-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(114 + 3) / 117]
     | 2022-03-15T06:47:17.427-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(116 + 1) / 117]
     | 2022-03-15T06:47:21.640-05:00 | [Stage 6 (countByKey at BaseSparkCommitActionExecutor.java:175):()(117 + 0) / 117]
     | 2022-03-15T06:47:27.011-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (0 + 117) / 117]
     | 2022-03-15T06:47:28.283-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (2 + 115) / 117]
     | 2022-03-15T06:47:29.508-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (6 + 111) / 117]
     | 2022-03-15T06:47:30.576-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):()(10 + 107) / 117]
     | 2022-03-15T06:47:32.991-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):()(15 + 102) / 117]
     | 2022-03-15T06:47:34.455-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (18 + 99) / 117]
     | 2022-03-15T06:47:35.560-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (23 + 94) / 117]
     | 2022-03-15T06:47:36.575-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (29 + 88) / 117]
     | 2022-03-15T06:47:37.689-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (41 + 76) / 117]
     | 2022-03-15T06:47:38.788-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (64 + 53) / 117]
     | 2022-03-15T06:47:39.865-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (89 + 28) / 117]
     | 2022-03-15T06:47:40.927-05:00 | [Stage 9 (mapToPair at BaseSparkCommitActionExecutor.java:209):> (110 + 7) / 117]
     | 2022-03-15T06:47:58.272-05:00 | [Stage 12 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (1 + 3) / 4]
     | 2022-03-15T06:48:04.654-05:00 | [Stage 14 (isEmpty at HoodieSparkSqlWriter.scala:609):> (1 + 19) / 20]
     | 2022-03-15T06:48:06.493-05:00 | [Stage 14 (isEmpty at HoodieSparkSqlWriter.scala:609):> (3 + 17) / 20]
     | 2022-03-15T06:48:16.497-05:00 | [Stage 16 (isEmpty at HoodieSparkSqlWriter.scala:609):> (0 + 100) / 100]
     | 2022-03-15T06:48:17.592-05:00 | [Stage 16 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (35 + 65) / 100]
     | 2022-03-15T06:48:19.544-05:00 | [Stage 16 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (37 + 63) / 100]
     | 2022-03-15T06:48:20.588-05:00 | [Stage 16 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (51 + 49) / 100]
     | 2022-03-15T06:48:21.593-05:00 | [Stage 16 (isEmpty at HoodieSparkSqlWriter.scala:609):======> (77 + 23) / 100]
     | 2022-03-15T06:48:23.827-05:00 | [Stage 16 (isEmpty at HoodieSparkSqlWriter.scala:609):=========> (93 + 7) / 100]
     | 2022-03-15T06:48:37.952-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):> (0 + 117) / 500]
     | 2022-03-15T06:48:38.995-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):> (68 + 117) / 500]
     | 2022-03-15T06:48:40.006-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):> (110 + 117) / 500]
     | 2022-03-15T06:48:51.385-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):> (116 + 117) / 500]
     | 2022-03-15T06:48:52.411-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=> (155 + 117) / 500]
     | 2022-03-15T06:48:53.421-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (214 + 117) / 500]
     | 2022-03-15T06:48:55.455-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (231 + 116) / 500]
     | 2022-03-15T06:49:04.055-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (232 + 117) / 500]
     | 2022-03-15T06:49:05.162-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (237 + 117) / 500]
     | 2022-03-15T06:49:06.169-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):===> (287 + 117) / 500]
     | 2022-03-15T06:49:07.176-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (320 + 117) / 500]
     | 2022-03-15T06:49:08.304-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (345 + 117) / 500]
     | 2022-03-15T06:49:17.407-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (348 + 116) / 500]
     | 2022-03-15T06:49:18.449-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (349 + 117) / 500]
     | 2022-03-15T06:49:19.450-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=====> (386 + 114) / 500]
     | 2022-03-15T06:49:20.457-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):======> (421 + 79) / 500]
     | 2022-03-15T06:49:21.528-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=======> (452 + 48) / 500]
     | 2022-03-15T06:49:24.903-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=======> (465 + 35) / 500]
     | 2022-03-15T06:49:26.307-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=======> (466 + 34) / 500]
     | 2022-03-15T06:49:27.724-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=======> (470 + 30) / 500]
     | 2022-03-15T06:49:31.513-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):=======> (481 + 19) / 500]
     | 2022-03-15T06:49:32.746-05:00 | [Stage 18 (isEmpty at HoodieSparkSqlWriter.scala:609):========> (496 + 4) / 500]
     | 2022-03-15T06:49:46.207-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):> (0 + 117) / 473]
     | 2022-03-15T06:49:47.211-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):> (61 + 117) / 473]
     | 2022-03-15T06:49:48.305-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):> (114 + 117) / 473]
     | 2022-03-15T06:49:59.447-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):> (116 + 117) / 473]
     | 2022-03-15T06:50:00.457-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):=> (158 + 117) / 473]
     | 2022-03-15T06:50:01.463-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (205 + 117) / 473]
     | 2022-03-15T06:50:02.652-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (231 + 117) / 473]
     | 2022-03-15T06:50:12.642-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):==> (232 + 117) / 473]
     | 2022-03-15T06:50:13.642-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):===> (261 + 117) / 473]
     | 2022-03-15T06:50:14.654-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (300 + 117) / 473]
     | 2022-03-15T06:50:15.668-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (335 + 117) / 473]
     | 2022-03-15T06:50:17.172-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (347 + 117) / 473]
     | 2022-03-15T06:50:25.789-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):====> (348 + 116) / 473]
     | 2022-03-15T06:50:26.840-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):=====> (372 + 101) / 473]
     | 2022-03-15T06:50:27.869-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):======> (406 + 67) / 473]
     | 2022-03-15T06:50:28.903-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):=======> (445 + 28) / 473]
     | 2022-03-15T06:50:31.957-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):========> (465 + 8) / 473]
     | 2022-03-15T06:50:33.009-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):========> (468 + 5) / 473]
     | 2022-03-15T06:50:34.267-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):========> (470 + 3) / 473]
     | 2022-03-15T06:50:35.294-05:00 | [Stage 20 (isEmpty at HoodieSparkSqlWriter.scala:609):=========>(473 + 0) / 473]
     | 2022-03-15T06:50:36.433-05:00 | [Stage 22 (collect at SparkRDDWriteClient.java:123):=====> (773 + 115) / 1098]
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5047: [SUPPORT] Small file creation while writing to a Hudi Table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5047:
URL: https://github.com/apache/hudi/issues/5047#issuecomment-1079815928


   recently we did fix the small file configs behavior 
   https://github.com/apache/hudi/pull/5129
   
   also, try setting "hoodie.parquet.block.size" in addition to "hoodie.parquet.max.file.size" which could impact your file sizes. 
   
   Please feel free to re-open if above suggestions does not work and if you can reproduce w/ latest master.
   thanks! 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] srinikandi commented on issue #5047: [SUPPORT] Small file creation while writing to a Hudi Table

Posted by GitBox <gi...@apache.org>.
srinikandi commented on issue #5047:
URL: https://github.com/apache/hudi/issues/5047#issuecomment-1080779635


   Is there any recommendation on how big these individual parquet files can be for optimal performance?.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] srinikandi commented on issue #5047: [SUPPORT] Small file creation while writing to a Hudi Table

Posted by GitBox <gi...@apache.org>.
srinikandi commented on issue #5047:
URL: https://github.com/apache/hudi/issues/5047#issuecomment-1080778514


   > @srinikandi : did you get a chance to try out the above config. Yes, I did have success using the parameters hoodie.copyonwrite.record.size.estimate and hoodie.parquet.block.size. Thanks for the suggestion.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #5047: [SUPPORT] Small file creation while writing to a Hudi Table

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #5047:
URL: https://github.com/apache/hudi/issues/5047


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tjtoll edited a comment on issue #5047: [SUPPORT] Small file creation while writing to a Hudi Table

Posted by GitBox <gi...@apache.org>.
tjtoll edited a comment on issue #5047:
URL: https://github.com/apache/hudi/issues/5047#issuecomment-1069788126


   I am experiencing the same exact problem - also with AWS Glue. I tried the below settings and they increased the file sizes to a more appropriate size, but the insert is taking 10x longer. Also interested in some guidance on this issue.
   
   'hoodie.copyonwrite.insert.auto.split': 'false',
   'hoodie.copyonwrite.insert.split.size': 100000000,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tjtoll commented on issue #5047: [SUPPORT] Small file creation while writing to a Hudi Table

Posted by GitBox <gi...@apache.org>.
tjtoll commented on issue #5047:
URL: https://github.com/apache/hudi/issues/5047#issuecomment-1069788126


   I am experiencing the same exact problem. I tried the below settings and they increased the file sizes to a more appropriate size, but the insert is taking 10x longer. Also interested in some guidance on this issue.
   
   'hoodie.copyonwrite.insert.auto.split': 'false',
   'hoodie.copyonwrite.insert.split.size': 100000000,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5047: [SUPPORT] Small file creation while writing to a Hudi Table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5047:
URL: https://github.com/apache/hudi/issues/5047#issuecomment-1072927114


   for the first time load to a new hudi table, "hoodie.copyonwrite.record.size.estimate" might be required. it will define how small files are handled. for subsequent commits, hudi can determine the record size from previous commits. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5047: [SUPPORT] Small file creation while writing to a Hudi Table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5047:
URL: https://github.com/apache/hudi/issues/5047#issuecomment-1075182524


   @srinikandi : did you get a chance to try out the above config. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] srinikandi commented on issue #5047: [SUPPORT] Small file creation while writing to a Hudi Table

Posted by GitBox <gi...@apache.org>.
srinikandi commented on issue #5047:
URL: https://github.com/apache/hudi/issues/5047#issuecomment-1069329016


   After some research, I found that by setting the below parameter to a lower value than the default value of 1024, the small file problem during the first load is fixed to an extent. Is this the recommended practice or should the small file size and the max file size limit parameters work for the first time load as well.
   
   "hoodie.copyonwrite.record.size.estimate": "100"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5047: [SUPPORT] Small file creation while writing to a Hudi Table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #5047:
URL: https://github.com/apache/hudi/issues/5047#issuecomment-1081975672


   it depends on your cluster and workload charaterization. tough to give one number. But for most cases, 120MB works well. unless your scale is huge. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org