You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by selvaraj periyasamy <se...@gmail.com> on 2020/03/15 19:16:35 UTC

Small Files

Team,

I am using Hudi 0.5.0. While writing COW table with below code, many small
files with 15 MB size are getting created, where as total partition size is
300MB +

  val output = transDetailsDF.write.format("org.apache.hudi").
          option("hoodie.insert.shuffle.parallelism", "2").
          option("hoodie.upsert.shuffle.parallelism", "2").
          option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
          option(OPERATION_OPT_KEY, "upsert").
          option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
          option(RECORDKEY_FIELD_OPT_KEY,"record_key").
          option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
          option(TABLE_NAME, tableName).

option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
          option("hoodie.memory.merge.max.size", "2004857600000").
          option("hoodie.bloom.index.prune.by.ranges","false").
          option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
          option("hoodie.cleaner.commits.retained", 2).
          option("hoodie.keep.min.commits",3).
          option("hoodie.keep.max.commits",5).

option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).

option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
          mode(Append).
          save(basePath);
As per instruction provided in
https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set
compactionSmallFileSize
to 100 MB and limitFileSize to 128 .

Hadoop block size is 256 MB , I am looking for 128 MB files are created.

Am I missing any config here?

Thanks,
Selva

Re: Small Files

Posted by Vinoth Chandar <vi...@apache.org>.
Hi Selva,

Hudi has a CLI which will summarize each commit nicely.. Can you also
provide output from that? it will tell you how many files are
created/updated etc
http://hudi.apache.org/docs/deployment.html#inspecting-commits

2765125 records in the initial batch is getting split into 2.7M/500K
partitions during writing (to get parallel write performance) as per the
config I pointed out before.. However this is not as high as 20, the amount
of files you are getting.. Can you share the driver logs around the
statemen below for the initial commit (HoodieCopyOnWrite#UpsertPartitioner
is what we want).. We can open a github issue if it makes it easier to
share logs/code etc..

LOG.info("Total Buckets :" + totalBuckets + ", buckets info => " +
bucketInfoMap + ", \n"
    + "Partition to insert buckets => " + partitionPathToInsertBuckets + ", \n"
+ "UpdateLocations mapped to buckets =>" + updateLocationToBucket);


Aside from that, I sample a file id in later commits, it does seem like the
it's getting re-written as expected.. So if we understand why you have 20
files to begin with we can go from there


On Mon, Mar 16, 2020 at 12:48 AM selvaraj periyasamy <
selvaraj.periyasamy1983@gmail.com> wrote:

> And then I ran updates for 2000 records  for 4 times and below are the
> files.
>
>   transDetailsDF1.write.format("org.apache.hudi").
>           option("hoodie.insert.shuffle.parallelism", "2").
>           option("hoodie.upsert.shuffle.parallelism", "2").
>           option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
>           option(OPERATION_OPT_KEY, "upsert").
>           option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
>           option(RECORDKEY_FIELD_OPT_KEY,"record_key").
>           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>           option(TABLE_NAME, tableName).
>
>
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
>           option("hoodie.memory.merge.max.size", "2004857600000").
>           option("hoodie.bloom.index.prune.by.ranges","false").
>           option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
>           option("hoodie.cleaner.commits.retained",1).
>           option("hoodie.keep.min.commits",2).
>           option("hoodie.keep.max.commits",3).
>
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
>           option("hoodie.copyonwrite.insert.split.size","2650000").
>           mode(Append).
>           save(basePath);
>
> Found 67 items
> -rw-r--r--   3 svchdc110p Hadoop_cdp         93 2020-03-16 07:12
>
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-116-663_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-116-667_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-116-661_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-116-670_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_5-116-666_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-116-662_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-116-668_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-116-669_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_4-116-665_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-116-664_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-116-677_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-116-675_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-116-674_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-116-672_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-116-676_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-116-680_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-116-671_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-116-679_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-116-678_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-116-673_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-116-682_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-116-681_20200316074213.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_4-147-831_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-147-833_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-147-827_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-147-835_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-147-829_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-147-839_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-147-838_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-147-847_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-147-843_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-147-846_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-147-842_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-147-841_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-147-837_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-147-845_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-147-830_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-147-848_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-147-840_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-147-836_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-147-828_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-147-834_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_5-147-832_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-147-844_20200316074336.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-178-1001_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-178-993_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-178-995_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_4-178-997_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-178-999_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-178-1000_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-178-1002_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_5-178-998_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-178-996_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-178-994_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-178-1014_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-178-1005_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-178-1006_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-178-1007_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-178-1004_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-178-1003_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-178-1011_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-178-1010_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-178-1012_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-178-1009_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-178-1008_20200316074511.parquet
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-178-1013_20200316074511.parquet
>
> Thanks,
> Selva
>
> On Mon, Mar 16, 2020 at 12:32 AM selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
> > Hi Vinoth,
> >
> > I tired multiple runs. The total records expected in the
> > partition is 2765125. Below is the spark-shell command.
> >
> > spark2-shell --jars hudi-spark-bundle-0.5.0-incubating.jar --conf
> > 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --master
> >  yarn --deploy-mode client --queue cybslarge --driver-memory 4g
> > --executor-memory 40g  --num-executors 5 --executor-cores 5 --conf
> > 'spark.executor.memoryOverhead=2048' --conf
> > 'spark.dynamicAllocation.enabled=false' --conf
> > 'spark.sql.hive.convertMetastoreParquet=false' --conf
> > 'spark.rdd.compress=true' --conf 'spark.kryoserializer.buffer.max=512m'
> --
> > 'spark.shuffle.service.enabled=true'
> >
> > Dynamic allocation set to false
> >
> > Attempt 1 -> Tried running mode is Overwrite and OPT_key is insert .
> Below
> > is the code.
> >
> >                 transDetailsDF1.write.format("org.apache.hudi").
> >
> >          option("hoodie.insert.shuffle.parallelism", "5").
> >
> >           option("hoodie.upsert.shuffle.parallelism", "5").
> >
> >          option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> >
> >           option(OPERATION_OPT_KEY, "insert").
> >
> >          option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> >
> >           option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> >
> >           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> >
> >           option(TABLE_NAME, tableName).
> >
> >
> >
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> >
> >           option("hoodie.memory.merge.max.size", "2004857600000").
> >
> >          option("hoodie.bloom.index.prune.by.ranges","false").
> >
> >          option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> >
> >          option("hoodie.cleaner.commits.retained",1).
> >
> >           option("hoodie.keep.min.commits",2).
> >
> >          option("hoodie.keep.max.commits",3).
> >
> >
> >
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> >
> >
> >
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> >
> >          option("hoodie.copyonwrite.insert.split.size","2650000").
> >
> >           mode(Overwrite).
> >
> >           save(basePath);
> >
> >
> >
> >
> > Below are the files in HDFS .
> >
> > Found 23 items
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp         93 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
> >
> >
> >
> >
> > Attempt 2 -> Updated 10 records with Append mode and upsert key
> >
> >
> >          transDetailsDF1.write.format("org.apache.hudi").
> >
> >           option("hoodie.insert.shuffle.parallelism", "5").
> >
> >          option("hoodie.upsert.shuffle.parallelism", "5").
> >
> >          option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> >
> >           option(OPERATION_OPT_KEY, "upsert").
> >
> >          option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> >
> >          option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> >
> >           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> >
> >           option(TABLE_NAME, tableName).
> >
> >
> >
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> >
> >          option("hoodie.memory.merge.max.size", "2004857600000").
> >
> >          option("hoodie.bloom.index.prune.by.ranges","false").
> >
> >          option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> >
> >          option("hoodie.cleaner.commits.retained",1).
> >
> >          option("hoodie.keep.min.commits",2).
> >
> >          option("hoodie.keep.max.commits",3).
> >
> >
> >
>  option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> >
> >
> >
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> >
> >          option("hoodie.copyonwrite.insert.split.size","2650000").
> >
> >           mode(Append).
> >
> >           save(basePath);
> >
> >
> >
> > Found 31 items
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp         93 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-121-1585_20200316071437.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_6-121-1588_20200316071437.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_4-121-1584_20200316071437.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_2-121-1582_20200316071437.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_3-121-1587_20200316071437.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_5-121-1586_20200316071437.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-121-1581_20200316071437.parquet
> >
> > -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_7-121-1583_20200316071437.parquet
> >
> >
> >
> >
> >
> >
> >
> > In both the cases, files sizes are around 15 MB.
> >
> >
> > Thanks,
> >
> > Selva
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sun, Mar 15, 2020 at 11:16 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> >> Hi Selva,
> >>
> >> Was this the first insert? Hudi handles small files by converting some
> >> inserts as updates to existing files. In this case, I see just one
> commit
> >> time, so there is nothing Hudi could optimize for.
> >> If you continue making updates/inserts over time, you should see these
> >> four
> >> files being expanded upto the configured limits, instead of new files
> >> being
> >> created..
> >>
> >> Let me know if that helps.. Also another config to pay attention to, in
> >> case of the first batch of inserts is
> >> http://hudi.apache.org/docs/configurations.html#insertSplitSize
> >>
> >> Thanks
> >> VInoth
> >>
> >> On Sun, Mar 15, 2020 at 12:19 PM selvaraj periyasamy <
> >> selvaraj.periyasamy1983@gmail.com> wrote:
> >>
> >> > Below are the few files.
> >> >
> >> > -rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
> >> >
> >> >
> >>
> /projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet
> >> > -rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
> >> >
> >> >
> >>
> /projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet
> >> > -rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
> >> >
> >> >
> >>
> /projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet
> >> > -rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
> >> >
> >> >
> >>
> /projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet
> >> >
> >> >
> >> > On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy <
> >> > selvaraj.periyasamy1983@gmail.com> wrote:
> >> >
> >> > > Team,
> >> > >
> >> > > I am using Hudi 0.5.0. While writing COW table with below code, many
> >> > small
> >> > > files with 15 MB size are getting created, where as total partition
> >> size
> >> > is
> >> > > 300MB +
> >> > >
> >> > >   val output = transDetailsDF.write.format("org.apache.hudi").
> >> > >           option("hoodie.insert.shuffle.parallelism", "2").
> >> > >           option("hoodie.upsert.shuffle.parallelism", "2").
> >> > >
> >>  option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> >> > >           option(OPERATION_OPT_KEY, "upsert").
> >> > >           option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> >> > >           option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> >> > >           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> >> > >           option(TABLE_NAME, tableName).
> >> > >
> >> > >
> >> >
> >>
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> >> > >           option("hoodie.memory.merge.max.size", "2004857600000").
> >> > >           option("hoodie.bloom.index.prune.by.ranges","false").
> >> > >
>  option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> >> > >           option("hoodie.cleaner.commits.retained", 2).
> >> > >           option("hoodie.keep.min.commits",3).
> >> > >           option("hoodie.keep.max.commits",5).
> >> > >
> >> > >
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> >> > >
> >> > >
> >> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> >> > >           mode(Append).
> >> > >           save(basePath);
> >> > > As per instruction provided in
> >> > > https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set
> >> > compactionSmallFileSize
> >> > > to 100 MB and limitFileSize to 128 .
> >> > >
> >> > > Hadoop block size is 256 MB , I am looking for 128 MB files are
> >> created.
> >> > >
> >> > > Am I missing any config here?
> >> > >
> >> > > Thanks,
> >> > > Selva
> >> > >
> >> >
> >>
> >
>

Re: Small Files

Posted by selvaraj periyasamy <se...@gmail.com>.
And then I ran updates for 2000 records  for 4 times and below are the
files.

  transDetailsDF1.write.format("org.apache.hudi").
          option("hoodie.insert.shuffle.parallelism", "2").
          option("hoodie.upsert.shuffle.parallelism", "2").
          option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
          option(OPERATION_OPT_KEY, "upsert").
          option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
          option(RECORDKEY_FIELD_OPT_KEY,"record_key").
          option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
          option(TABLE_NAME, tableName).

option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
          option("hoodie.memory.merge.max.size", "2004857600000").
          option("hoodie.bloom.index.prune.by.ranges","false").
          option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
          option("hoodie.cleaner.commits.retained",1).
          option("hoodie.keep.min.commits",2).
          option("hoodie.keep.max.commits",3).

option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).

option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
          option("hoodie.copyonwrite.insert.split.size","2650000").
          mode(Append).
          save(basePath);

Found 67 items
-rw-r--r--   3 svchdc110p Hadoop_cdp         93 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-116-663_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-116-667_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-116-661_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-116-670_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_5-116-666_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-116-662_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-116-668_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-116-669_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_4-116-665_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-116-664_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-116-677_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-116-675_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-116-674_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-116-672_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-116-676_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-116-680_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-116-671_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-116-679_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-116-678_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-116-673_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-116-682_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-116-681_20200316074213.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_4-147-831_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-147-833_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-147-827_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-147-835_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-147-829_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-147-839_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-147-838_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-147-847_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-147-843_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-147-846_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-147-842_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-147-841_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-147-837_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-147-845_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-147-830_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-147-848_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-147-840_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-147-836_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-147-828_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-147-834_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_5-147-832_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-147-844_20200316074336.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-178-1001_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-178-993_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-178-995_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_4-178-997_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-178-999_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-178-1000_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-178-1002_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_5-178-998_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-178-996_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-178-994_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-178-1014_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-178-1005_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-178-1006_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-178-1007_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-178-1004_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-178-1003_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-178-1011_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-178-1010_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-178-1012_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-178-1009_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-178-1008_20200316074511.parquet
-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-178-1013_20200316074511.parquet

Thanks,
Selva

On Mon, Mar 16, 2020 at 12:32 AM selvaraj periyasamy <
selvaraj.periyasamy1983@gmail.com> wrote:

> Hi Vinoth,
>
> I tired multiple runs. The total records expected in the
> partition is 2765125. Below is the spark-shell command.
>
> spark2-shell --jars hudi-spark-bundle-0.5.0-incubating.jar --conf
> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --master
>  yarn --deploy-mode client --queue cybslarge --driver-memory 4g
> --executor-memory 40g  --num-executors 5 --executor-cores 5 --conf
> 'spark.executor.memoryOverhead=2048' --conf
> 'spark.dynamicAllocation.enabled=false' --conf
> 'spark.sql.hive.convertMetastoreParquet=false' --conf
> 'spark.rdd.compress=true' --conf 'spark.kryoserializer.buffer.max=512m' --
> 'spark.shuffle.service.enabled=true'
>
> Dynamic allocation set to false
>
> Attempt 1 -> Tried running mode is Overwrite and OPT_key is insert . Below
> is the code.
>
>                 transDetailsDF1.write.format("org.apache.hudi").
>
>          option("hoodie.insert.shuffle.parallelism", "5").
>
>           option("hoodie.upsert.shuffle.parallelism", "5").
>
>          option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
>
>           option(OPERATION_OPT_KEY, "insert").
>
>          option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
>
>           option(RECORDKEY_FIELD_OPT_KEY,"record_key").
>
>           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>
>           option(TABLE_NAME, tableName).
>
>
>          option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
>
>           option("hoodie.memory.merge.max.size", "2004857600000").
>
>          option("hoodie.bloom.index.prune.by.ranges","false").
>
>          option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
>
>          option("hoodie.cleaner.commits.retained",1).
>
>           option("hoodie.keep.min.commits",2).
>
>          option("hoodie.keep.max.commits",3).
>
>
>          option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>
>
>          option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
>
>          option("hoodie.copyonwrite.insert.split.size","2650000").
>
>           mode(Overwrite).
>
>           save(basePath);
>
>
>
>
> Below are the files in HDFS .
>
> Found 23 items
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp         93 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
>
>
>
>
> Attempt 2 -> Updated 10 records with Append mode and upsert key
>
>
>          transDetailsDF1.write.format("org.apache.hudi").
>
>           option("hoodie.insert.shuffle.parallelism", "5").
>
>          option("hoodie.upsert.shuffle.parallelism", "5").
>
>          option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
>
>           option(OPERATION_OPT_KEY, "upsert").
>
>          option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
>
>          option(RECORDKEY_FIELD_OPT_KEY,"record_key").
>
>           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>
>           option(TABLE_NAME, tableName).
>
>
>          option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
>
>          option("hoodie.memory.merge.max.size", "2004857600000").
>
>          option("hoodie.bloom.index.prune.by.ranges","false").
>
>          option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
>
>          option("hoodie.cleaner.commits.retained",1).
>
>          option("hoodie.keep.min.commits",2).
>
>          option("hoodie.keep.max.commits",3).
>
>
>           option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>
>
>          option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
>
>          option("hoodie.copyonwrite.insert.split.size","2650000").
>
>           mode(Append).
>
>           save(basePath);
>
>
>
> Found 31 items
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp         93 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-121-1585_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_6-121-1588_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_4-121-1584_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_2-121-1582_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_3-121-1587_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_5-121-1586_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-121-1581_20200316071437.parquet
>
> -rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_7-121-1583_20200316071437.parquet
>
>
>
>
>
>
>
> In both the cases, files sizes are around 15 MB.
>
>
> Thanks,
>
> Selva
>
>
>
>
>
>
>
>
> On Sun, Mar 15, 2020 at 11:16 PM Vinoth Chandar <vi...@apache.org> wrote:
>
>> Hi Selva,
>>
>> Was this the first insert? Hudi handles small files by converting some
>> inserts as updates to existing files. In this case, I see just one commit
>> time, so there is nothing Hudi could optimize for.
>> If you continue making updates/inserts over time, you should see these
>> four
>> files being expanded upto the configured limits, instead of new files
>> being
>> created..
>>
>> Let me know if that helps.. Also another config to pay attention to, in
>> case of the first batch of inserts is
>> http://hudi.apache.org/docs/configurations.html#insertSplitSize
>>
>> Thanks
>> VInoth
>>
>> On Sun, Mar 15, 2020 at 12:19 PM selvaraj periyasamy <
>> selvaraj.periyasamy1983@gmail.com> wrote:
>>
>> > Below are the few files.
>> >
>> > -rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
>> >
>> >
>> /projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet
>> > -rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
>> >
>> >
>> /projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet
>> > -rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
>> >
>> >
>> /projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet
>> > -rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
>> >
>> >
>> /projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet
>> >
>> >
>> > On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy <
>> > selvaraj.periyasamy1983@gmail.com> wrote:
>> >
>> > > Team,
>> > >
>> > > I am using Hudi 0.5.0. While writing COW table with below code, many
>> > small
>> > > files with 15 MB size are getting created, where as total partition
>> size
>> > is
>> > > 300MB +
>> > >
>> > >   val output = transDetailsDF.write.format("org.apache.hudi").
>> > >           option("hoodie.insert.shuffle.parallelism", "2").
>> > >           option("hoodie.upsert.shuffle.parallelism", "2").
>> > >
>>  option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
>> > >           option(OPERATION_OPT_KEY, "upsert").
>> > >           option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
>> > >           option(RECORDKEY_FIELD_OPT_KEY,"record_key").
>> > >           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>> > >           option(TABLE_NAME, tableName).
>> > >
>> > >
>> >
>> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
>> > >           option("hoodie.memory.merge.max.size", "2004857600000").
>> > >           option("hoodie.bloom.index.prune.by.ranges","false").
>> > >           option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
>> > >           option("hoodie.cleaner.commits.retained", 2).
>> > >           option("hoodie.keep.min.commits",3).
>> > >           option("hoodie.keep.max.commits",5).
>> > >
>> > > option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>> > >
>> > >
>> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
>> > >           mode(Append).
>> > >           save(basePath);
>> > > As per instruction provided in
>> > > https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set
>> > compactionSmallFileSize
>> > > to 100 MB and limitFileSize to 128 .
>> > >
>> > > Hadoop block size is 256 MB , I am looking for 128 MB files are
>> created.
>> > >
>> > > Am I missing any config here?
>> > >
>> > > Thanks,
>> > > Selva
>> > >
>> >
>>
>

Re: Small Files

Posted by selvaraj periyasamy <se...@gmail.com>.
Hi Vinoth,

I tired multiple runs. The total records expected in the
partition is 2765125. Below is the spark-shell command.

spark2-shell --jars hudi-spark-bundle-0.5.0-incubating.jar --conf
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --master
 yarn --deploy-mode client --queue cybslarge --driver-memory 4g
--executor-memory 40g  --num-executors 5 --executor-cores 5 --conf
'spark.executor.memoryOverhead=2048' --conf
'spark.dynamicAllocation.enabled=false' --conf
'spark.sql.hive.convertMetastoreParquet=false' --conf
'spark.rdd.compress=true' --conf 'spark.kryoserializer.buffer.max=512m' --
'spark.shuffle.service.enabled=true'

Dynamic allocation set to false

Attempt 1 -> Tried running mode is Overwrite and OPT_key is insert . Below
is the code.

                transDetailsDF1.write.format("org.apache.hudi").

         option("hoodie.insert.shuffle.parallelism", "5").

          option("hoodie.upsert.shuffle.parallelism", "5").

         option("hoodie.datasource.write.table.type","COPY_ON_WRITE").

          option(OPERATION_OPT_KEY, "insert").

         option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").

          option(RECORDKEY_FIELD_OPT_KEY,"record_key").

          option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").

          option(TABLE_NAME, tableName).

         option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").

          option("hoodie.memory.merge.max.size", "2004857600000").

         option("hoodie.bloom.index.prune.by.ranges","false").

         option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").

         option("hoodie.cleaner.commits.retained",1).

          option("hoodie.keep.min.commits",2).

         option("hoodie.keep.max.commits",3).

         option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).

         option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).

         option("hoodie.copyonwrite.insert.split.size","2650000").

          mode(Overwrite).

          save(basePath);




Below are the files in HDFS .

Found 23 items

-rw-r--r--   3 svchdc110p Hadoop_cdp         93 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet




Attempt 2 -> Updated 10 records with Append mode and upsert key


         transDetailsDF1.write.format("org.apache.hudi").

          option("hoodie.insert.shuffle.parallelism", "5").

         option("hoodie.upsert.shuffle.parallelism", "5").

         option("hoodie.datasource.write.table.type","COPY_ON_WRITE").

          option(OPERATION_OPT_KEY, "upsert").

         option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").

         option(RECORDKEY_FIELD_OPT_KEY,"record_key").

          option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").

          option(TABLE_NAME, tableName).

         option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").

         option("hoodie.memory.merge.max.size", "2004857600000").

         option("hoodie.bloom.index.prune.by.ranges","false").

         option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").

         option("hoodie.cleaner.commits.retained",1).

         option("hoodie.keep.min.commits",2).

         option("hoodie.keep.max.commits",3).

          option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).

         option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).

         option("hoodie.copyonwrite.insert.split.size","2650000").

          mode(Append).

          save(basePath);



Found 31 items

-rw-r--r--   3 svchdc110p Hadoop_cdp         93 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     15.0 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-121-1585_20200316071437.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_6-121-1588_20200316071437.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_4-121-1584_20200316071437.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_2-121-1582_20200316071437.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_3-121-1587_20200316071437.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_5-121-1586_20200316071437.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-121-1581_20200316071437.parquet

-rw-r--r--   3 svchdc110p Hadoop_cdp     14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_7-121-1583_20200316071437.parquet







In both the cases, files sizes are around 15 MB.


Thanks,

Selva








On Sun, Mar 15, 2020 at 11:16 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Selva,
>
> Was this the first insert? Hudi handles small files by converting some
> inserts as updates to existing files. In this case, I see just one commit
> time, so there is nothing Hudi could optimize for.
> If you continue making updates/inserts over time, you should see these four
> files being expanded upto the configured limits, instead of new files being
> created..
>
> Let me know if that helps.. Also another config to pay attention to, in
> case of the first batch of inserts is
> http://hudi.apache.org/docs/configurations.html#insertSplitSize
>
> Thanks
> VInoth
>
> On Sun, Mar 15, 2020 at 12:19 PM selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
> > Below are the few files.
> >
> > -rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
> >
> >
> /projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet
> > -rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
> >
> >
> /projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet
> > -rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
> >
> >
> /projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet
> > -rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
> >
> >
> /projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet
> >
> >
> > On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy <
> > selvaraj.periyasamy1983@gmail.com> wrote:
> >
> > > Team,
> > >
> > > I am using Hudi 0.5.0. While writing COW table with below code, many
> > small
> > > files with 15 MB size are getting created, where as total partition
> size
> > is
> > > 300MB +
> > >
> > >   val output = transDetailsDF.write.format("org.apache.hudi").
> > >           option("hoodie.insert.shuffle.parallelism", "2").
> > >           option("hoodie.upsert.shuffle.parallelism", "2").
> > >           option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> > >           option(OPERATION_OPT_KEY, "upsert").
> > >           option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> > >           option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> > >           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> > >           option(TABLE_NAME, tableName).
> > >
> > >
> >
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> > >           option("hoodie.memory.merge.max.size", "2004857600000").
> > >           option("hoodie.bloom.index.prune.by.ranges","false").
> > >           option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> > >           option("hoodie.cleaner.commits.retained", 2).
> > >           option("hoodie.keep.min.commits",3).
> > >           option("hoodie.keep.max.commits",5).
> > >
> > > option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> > >
> > >
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> > >           mode(Append).
> > >           save(basePath);
> > > As per instruction provided in
> > > https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set
> > compactionSmallFileSize
> > > to 100 MB and limitFileSize to 128 .
> > >
> > > Hadoop block size is 256 MB , I am looking for 128 MB files are
> created.
> > >
> > > Am I missing any config here?
> > >
> > > Thanks,
> > > Selva
> > >
> >
>

Re: Small Files

Posted by Vinoth Chandar <vi...@apache.org>.
Hi Selva,

Was this the first insert? Hudi handles small files by converting some
inserts as updates to existing files. In this case, I see just one commit
time, so there is nothing Hudi could optimize for.
If you continue making updates/inserts over time, you should see these four
files being expanded upto the configured limits, instead of new files being
created..

Let me know if that helps.. Also another config to pay attention to, in
case of the first batch of inserts is
http://hudi.apache.org/docs/configurations.html#insertSplitSize

Thanks
VInoth

On Sun, Mar 15, 2020 at 12:19 PM selvaraj periyasamy <
selvaraj.periyasamy1983@gmail.com> wrote:

> Below are the few files.
>
> -rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
>
> /projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet
> -rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
>
> /projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet
> -rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
>
> /projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet
> -rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
>
> /projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet
>
>
> On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
> > Team,
> >
> > I am using Hudi 0.5.0. While writing COW table with below code, many
> small
> > files with 15 MB size are getting created, where as total partition size
> is
> > 300MB +
> >
> >   val output = transDetailsDF.write.format("org.apache.hudi").
> >           option("hoodie.insert.shuffle.parallelism", "2").
> >           option("hoodie.upsert.shuffle.parallelism", "2").
> >           option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> >           option(OPERATION_OPT_KEY, "upsert").
> >           option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> >           option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> >           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> >           option(TABLE_NAME, tableName).
> >
> >
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> >           option("hoodie.memory.merge.max.size", "2004857600000").
> >           option("hoodie.bloom.index.prune.by.ranges","false").
> >           option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> >           option("hoodie.cleaner.commits.retained", 2).
> >           option("hoodie.keep.min.commits",3).
> >           option("hoodie.keep.max.commits",5).
> >
> > option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> >
> > option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> >           mode(Append).
> >           save(basePath);
> > As per instruction provided in
> > https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set
> compactionSmallFileSize
> > to 100 MB and limitFileSize to 128 .
> >
> > Hadoop block size is 256 MB , I am looking for 128 MB files are created.
> >
> > Am I missing any config here?
> >
> > Thanks,
> > Selva
> >
>

Re: Small Files

Posted by selvaraj periyasamy <se...@gmail.com>.
Below are the few files.

-rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
/projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet
-rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
/projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet
-rw-r--r--   3 dvcc Hadoop_cdp     15.2 M 2020-03-15 19:09
/projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet
-rw-r--r--   3 dvcc Hadoop_cdp     15.1 M 2020-03-15 19:09
/projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet


On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy <
selvaraj.periyasamy1983@gmail.com> wrote:

> Team,
>
> I am using Hudi 0.5.0. While writing COW table with below code, many small
> files with 15 MB size are getting created, where as total partition size is
> 300MB +
>
>   val output = transDetailsDF.write.format("org.apache.hudi").
>           option("hoodie.insert.shuffle.parallelism", "2").
>           option("hoodie.upsert.shuffle.parallelism", "2").
>           option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
>           option(OPERATION_OPT_KEY, "upsert").
>           option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
>           option(RECORDKEY_FIELD_OPT_KEY,"record_key").
>           option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>           option(TABLE_NAME, tableName).
>
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
>           option("hoodie.memory.merge.max.size", "2004857600000").
>           option("hoodie.bloom.index.prune.by.ranges","false").
>           option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
>           option("hoodie.cleaner.commits.retained", 2).
>           option("hoodie.keep.min.commits",3).
>           option("hoodie.keep.max.commits",5).
>
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
>           mode(Append).
>           save(basePath);
> As per instruction provided in
> https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set compactionSmallFileSize
> to 100 MB and limitFileSize to 128 .
>
> Hadoop block size is 256 MB , I am looking for 128 MB files are created.
>
> Am I missing any config here?
>
> Thanks,
> Selva
>