You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by selvaraj periyasamy <se...@gmail.com> on 2020/03/15 19:16:35 UTC
Small Files
Team,
I am using Hudi 0.5.0. While writing COW table with below code, many small
files with 15 MB size are getting created, where as total partition size is
300MB +
val output = transDetailsDF.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "2").
option("hoodie.upsert.shuffle.parallelism", "2").
option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
option(OPERATION_OPT_KEY, "upsert").
option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
option(RECORDKEY_FIELD_OPT_KEY,"record_key").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
option("hoodie.memory.merge.max.size", "2004857600000").
option("hoodie.bloom.index.prune.by.ranges","false").
option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
option("hoodie.cleaner.commits.retained", 2).
option("hoodie.keep.min.commits",3).
option("hoodie.keep.max.commits",5).
option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
mode(Append).
save(basePath);
As per instruction provided in
https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set
compactionSmallFileSize
to 100 MB and limitFileSize to 128 .
Hadoop block size is 256 MB , I am looking for 128 MB files are created.
Am I missing any config here?
Thanks,
Selva
Re: Small Files
Posted by Vinoth Chandar <vi...@apache.org>.
Hi Selva,
Hudi has a CLI which will summarize each commit nicely.. Can you also
provide output from that? it will tell you how many files are
created/updated etc
http://hudi.apache.org/docs/deployment.html#inspecting-commits
2765125 records in the initial batch is getting split into 2.7M/500K
partitions during writing (to get parallel write performance) as per the
config I pointed out before.. However this is not as high as 20, the amount
of files you are getting.. Can you share the driver logs around the
statemen below for the initial commit (HoodieCopyOnWrite#UpsertPartitioner
is what we want).. We can open a github issue if it makes it easier to
share logs/code etc..
LOG.info("Total Buckets :" + totalBuckets + ", buckets info => " +
bucketInfoMap + ", \n"
+ "Partition to insert buckets => " + partitionPathToInsertBuckets + ", \n"
+ "UpdateLocations mapped to buckets =>" + updateLocationToBucket);
Aside from that, I sample a file id in later commits, it does seem like the
it's getting re-written as expected.. So if we understand why you have 20
files to begin with we can go from there
On Mon, Mar 16, 2020 at 12:48 AM selvaraj periyasamy <
selvaraj.periyasamy1983@gmail.com> wrote:
> And then I ran updates for 2000 records for 4 times and below are the
> files.
>
> transDetailsDF1.write.format("org.apache.hudi").
> option("hoodie.insert.shuffle.parallelism", "2").
> option("hoodie.upsert.shuffle.parallelism", "2").
> option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> option(OPERATION_OPT_KEY, "upsert").
> option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> option(TABLE_NAME, tableName).
>
>
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> option("hoodie.memory.merge.max.size", "2004857600000").
> option("hoodie.bloom.index.prune.by.ranges","false").
> option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> option("hoodie.cleaner.commits.retained",1).
> option("hoodie.keep.min.commits",2).
> option("hoodie.keep.max.commits",3).
>
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> option("hoodie.copyonwrite.insert.split.size","2650000").
> mode(Append).
> save(basePath);
>
> Found 67 items
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 93 2020-03-16 07:12
>
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-116-663_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-116-667_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-116-661_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-116-670_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_5-116-666_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-116-662_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-116-668_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-116-669_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_4-116-665_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-116-664_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-116-677_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-116-675_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-116-674_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-116-672_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-116-676_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-116-680_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-116-671_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-116-679_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-116-678_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-116-673_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-116-682_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
>
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-116-681_20200316074213.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_4-147-831_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-147-833_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-147-827_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-147-835_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-147-829_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-147-839_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-147-838_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-147-847_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-147-843_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-147-846_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-147-842_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-147-841_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-147-837_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-147-845_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-147-830_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-147-848_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-147-840_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-147-836_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-147-828_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-147-834_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_5-147-832_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
>
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-147-844_20200316074336.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-178-1001_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-178-993_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-178-995_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_4-178-997_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-178-999_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-178-1000_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-178-1002_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_5-178-998_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-178-996_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-178-994_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-178-1014_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-178-1005_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-178-1006_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-178-1007_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-178-1004_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-178-1003_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-178-1011_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-178-1010_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-178-1012_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-178-1009_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-178-1008_20200316074511.parquet
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
>
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-178-1013_20200316074511.parquet
>
> Thanks,
> Selva
>
> On Mon, Mar 16, 2020 at 12:32 AM selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
> > Hi Vinoth,
> >
> > I tired multiple runs. The total records expected in the
> > partition is 2765125. Below is the spark-shell command.
> >
> > spark2-shell --jars hudi-spark-bundle-0.5.0-incubating.jar --conf
> > 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --master
> > yarn --deploy-mode client --queue cybslarge --driver-memory 4g
> > --executor-memory 40g --num-executors 5 --executor-cores 5 --conf
> > 'spark.executor.memoryOverhead=2048' --conf
> > 'spark.dynamicAllocation.enabled=false' --conf
> > 'spark.sql.hive.convertMetastoreParquet=false' --conf
> > 'spark.rdd.compress=true' --conf 'spark.kryoserializer.buffer.max=512m'
> --
> > 'spark.shuffle.service.enabled=true'
> >
> > Dynamic allocation set to false
> >
> > Attempt 1 -> Tried running mode is Overwrite and OPT_key is insert .
> Below
> > is the code.
> >
> > transDetailsDF1.write.format("org.apache.hudi").
> >
> > option("hoodie.insert.shuffle.parallelism", "5").
> >
> > option("hoodie.upsert.shuffle.parallelism", "5").
> >
> > option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> >
> > option(OPERATION_OPT_KEY, "insert").
> >
> > option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> >
> > option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> >
> > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> >
> > option(TABLE_NAME, tableName).
> >
> >
> >
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> >
> > option("hoodie.memory.merge.max.size", "2004857600000").
> >
> > option("hoodie.bloom.index.prune.by.ranges","false").
> >
> > option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> >
> > option("hoodie.cleaner.commits.retained",1).
> >
> > option("hoodie.keep.min.commits",2).
> >
> > option("hoodie.keep.max.commits",3).
> >
> >
> >
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> >
> >
> >
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> >
> > option("hoodie.copyonwrite.insert.split.size","2650000").
> >
> > mode(Overwrite).
> >
> > save(basePath);
> >
> >
> >
> >
> > Below are the files in HDFS .
> >
> > Found 23 items
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 93 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
> >
> >
> >
> >
> > Attempt 2 -> Updated 10 records with Append mode and upsert key
> >
> >
> > transDetailsDF1.write.format("org.apache.hudi").
> >
> > option("hoodie.insert.shuffle.parallelism", "5").
> >
> > option("hoodie.upsert.shuffle.parallelism", "5").
> >
> > option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> >
> > option(OPERATION_OPT_KEY, "upsert").
> >
> > option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> >
> > option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> >
> > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> >
> > option(TABLE_NAME, tableName).
> >
> >
> >
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> >
> > option("hoodie.memory.merge.max.size", "2004857600000").
> >
> > option("hoodie.bloom.index.prune.by.ranges","false").
> >
> > option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> >
> > option("hoodie.cleaner.commits.retained",1).
> >
> > option("hoodie.keep.min.commits",2).
> >
> > option("hoodie.keep.max.commits",3).
> >
> >
> >
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> >
> >
> >
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> >
> > option("hoodie.copyonwrite.insert.split.size","2650000").
> >
> > mode(Append).
> >
> > save(basePath);
> >
> >
> >
> > Found 31 items
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 93 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> >
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-121-1585_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_6-121-1588_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_4-121-1584_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_2-121-1582_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_3-121-1587_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_5-121-1586_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-121-1581_20200316071437.parquet
> >
> > -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> >
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_7-121-1583_20200316071437.parquet
> >
> >
> >
> >
> >
> >
> >
> > In both the cases, files sizes are around 15 MB.
> >
> >
> > Thanks,
> >
> > Selva
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sun, Mar 15, 2020 at 11:16 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> >> Hi Selva,
> >>
> >> Was this the first insert? Hudi handles small files by converting some
> >> inserts as updates to existing files. In this case, I see just one
> commit
> >> time, so there is nothing Hudi could optimize for.
> >> If you continue making updates/inserts over time, you should see these
> >> four
> >> files being expanded upto the configured limits, instead of new files
> >> being
> >> created..
> >>
> >> Let me know if that helps.. Also another config to pay attention to, in
> >> case of the first batch of inserts is
> >> http://hudi.apache.org/docs/configurations.html#insertSplitSize
> >>
> >> Thanks
> >> VInoth
> >>
> >> On Sun, Mar 15, 2020 at 12:19 PM selvaraj periyasamy <
> >> selvaraj.periyasamy1983@gmail.com> wrote:
> >>
> >> > Below are the few files.
> >> >
> >> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09
> >> >
> >> >
> >>
> /projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet
> >> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09
> >> >
> >> >
> >>
> /projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet
> >> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09
> >> >
> >> >
> >>
> /projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet
> >> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09
> >> >
> >> >
> >>
> /projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet
> >> >
> >> >
> >> > On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy <
> >> > selvaraj.periyasamy1983@gmail.com> wrote:
> >> >
> >> > > Team,
> >> > >
> >> > > I am using Hudi 0.5.0. While writing COW table with below code, many
> >> > small
> >> > > files with 15 MB size are getting created, where as total partition
> >> size
> >> > is
> >> > > 300MB +
> >> > >
> >> > > val output = transDetailsDF.write.format("org.apache.hudi").
> >> > > option("hoodie.insert.shuffle.parallelism", "2").
> >> > > option("hoodie.upsert.shuffle.parallelism", "2").
> >> > >
> >> option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> >> > > option(OPERATION_OPT_KEY, "upsert").
> >> > > option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> >> > > option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> >> > > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> >> > > option(TABLE_NAME, tableName).
> >> > >
> >> > >
> >> >
> >>
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> >> > > option("hoodie.memory.merge.max.size", "2004857600000").
> >> > > option("hoodie.bloom.index.prune.by.ranges","false").
> >> > >
> option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> >> > > option("hoodie.cleaner.commits.retained", 2).
> >> > > option("hoodie.keep.min.commits",3).
> >> > > option("hoodie.keep.max.commits",5).
> >> > >
> >> > >
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> >> > >
> >> > >
> >> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> >> > > mode(Append).
> >> > > save(basePath);
> >> > > As per instruction provided in
> >> > > https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set
> >> > compactionSmallFileSize
> >> > > to 100 MB and limitFileSize to 128 .
> >> > >
> >> > > Hadoop block size is 256 MB , I am looking for 128 MB files are
> >> created.
> >> > >
> >> > > Am I missing any config here?
> >> > >
> >> > > Thanks,
> >> > > Selva
> >> > >
> >> >
> >>
> >
>
Re: Small Files
Posted by selvaraj periyasamy <se...@gmail.com>.
And then I ran updates for 2000 records for 4 times and below are the
files.
transDetailsDF1.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "2").
option("hoodie.upsert.shuffle.parallelism", "2").
option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
option(OPERATION_OPT_KEY, "upsert").
option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
option(RECORDKEY_FIELD_OPT_KEY,"record_key").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
option("hoodie.memory.merge.max.size", "2004857600000").
option("hoodie.bloom.index.prune.by.ranges","false").
option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
option("hoodie.cleaner.commits.retained",1).
option("hoodie.keep.min.commits",2).
option("hoodie.keep.max.commits",3).
option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
option("hoodie.copyonwrite.insert.split.size","2650000").
mode(Append).
save(basePath);
Found 67 items
-rw-r--r-- 3 svchdc110p Hadoop_cdp 93 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-116-663_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-116-667_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-116-661_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-116-670_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_5-116-666_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-116-662_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-116-668_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-116-669_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_4-116-665_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-116-664_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-116-677_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-116-675_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-116-674_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-116-672_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-116-676_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-116-680_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-116-671_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-116-679_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-116-678_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-116-673_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-116-682_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:42
/projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-116-681_20200316074213.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_4-147-831_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-147-833_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-147-827_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-147-835_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-147-829_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-147-839_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-147-838_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-147-847_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-147-843_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:43
/projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-147-846_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-147-842_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-147-841_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-147-837_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-147-845_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-147-830_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-147-848_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-147-840_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-147-836_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-147-828_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-147-834_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_5-147-832_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:44
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-147-844_20200316074336.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_8-178-1001_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-178-993_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_2-178-995_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_4-178-997_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-178-999_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_7-178-1000_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_9-178-1002_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_5-178-998_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_3-178-996_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-178-994_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_21-178-1014_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_12-178-1005_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_13-178-1006_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_14-178-1007_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_11-178-1004_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_10-178-1003_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_18-178-1011_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_17-178-1010_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_19-178-1012_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_16-178-1009_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_15-178-1008_20200316074511.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:45
/projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_20-178-1013_20200316074511.parquet
Thanks,
Selva
On Mon, Mar 16, 2020 at 12:32 AM selvaraj periyasamy <
selvaraj.periyasamy1983@gmail.com> wrote:
> Hi Vinoth,
>
> I tired multiple runs. The total records expected in the
> partition is 2765125. Below is the spark-shell command.
>
> spark2-shell --jars hudi-spark-bundle-0.5.0-incubating.jar --conf
> 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --master
> yarn --deploy-mode client --queue cybslarge --driver-memory 4g
> --executor-memory 40g --num-executors 5 --executor-cores 5 --conf
> 'spark.executor.memoryOverhead=2048' --conf
> 'spark.dynamicAllocation.enabled=false' --conf
> 'spark.sql.hive.convertMetastoreParquet=false' --conf
> 'spark.rdd.compress=true' --conf 'spark.kryoserializer.buffer.max=512m' --
> 'spark.shuffle.service.enabled=true'
>
> Dynamic allocation set to false
>
> Attempt 1 -> Tried running mode is Overwrite and OPT_key is insert . Below
> is the code.
>
> transDetailsDF1.write.format("org.apache.hudi").
>
> option("hoodie.insert.shuffle.parallelism", "5").
>
> option("hoodie.upsert.shuffle.parallelism", "5").
>
> option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
>
> option(OPERATION_OPT_KEY, "insert").
>
> option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
>
> option(RECORDKEY_FIELD_OPT_KEY,"record_key").
>
> option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>
> option(TABLE_NAME, tableName).
>
>
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
>
> option("hoodie.memory.merge.max.size", "2004857600000").
>
> option("hoodie.bloom.index.prune.by.ranges","false").
>
> option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
>
> option("hoodie.cleaner.commits.retained",1).
>
> option("hoodie.keep.min.commits",2).
>
> option("hoodie.keep.max.commits",3).
>
>
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>
>
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
>
> option("hoodie.copyonwrite.insert.split.size","2650000").
>
> mode(Overwrite).
>
> save(basePath);
>
>
>
>
> Below are the files in HDFS .
>
> Found 23 items
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 93 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
>
>
>
>
> Attempt 2 -> Updated 10 records with Append mode and upsert key
>
>
> transDetailsDF1.write.format("org.apache.hudi").
>
> option("hoodie.insert.shuffle.parallelism", "5").
>
> option("hoodie.upsert.shuffle.parallelism", "5").
>
> option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
>
> option(OPERATION_OPT_KEY, "upsert").
>
> option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
>
> option(RECORDKEY_FIELD_OPT_KEY,"record_key").
>
> option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>
> option(TABLE_NAME, tableName).
>
>
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
>
> option("hoodie.memory.merge.max.size", "2004857600000").
>
> option("hoodie.bloom.index.prune.by.ranges","false").
>
> option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
>
> option("hoodie.cleaner.commits.retained",1).
>
> option("hoodie.keep.min.commits",2).
>
> option("hoodie.keep.max.commits",3).
>
>
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>
>
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
>
> option("hoodie.copyonwrite.insert.split.size","2650000").
>
> mode(Append).
>
> save(basePath);
>
>
>
> Found 31 items
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 93 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-121-1585_20200316071437.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_6-121-1588_20200316071437.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_4-121-1584_20200316071437.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_2-121-1582_20200316071437.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_3-121-1587_20200316071437.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_5-121-1586_20200316071437.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-121-1581_20200316071437.parquet
>
> -rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
> /projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_7-121-1583_20200316071437.parquet
>
>
>
>
>
>
>
> In both the cases, files sizes are around 15 MB.
>
>
> Thanks,
>
> Selva
>
>
>
>
>
>
>
>
> On Sun, Mar 15, 2020 at 11:16 PM Vinoth Chandar <vi...@apache.org> wrote:
>
>> Hi Selva,
>>
>> Was this the first insert? Hudi handles small files by converting some
>> inserts as updates to existing files. In this case, I see just one commit
>> time, so there is nothing Hudi could optimize for.
>> If you continue making updates/inserts over time, you should see these
>> four
>> files being expanded upto the configured limits, instead of new files
>> being
>> created..
>>
>> Let me know if that helps.. Also another config to pay attention to, in
>> case of the first batch of inserts is
>> http://hudi.apache.org/docs/configurations.html#insertSplitSize
>>
>> Thanks
>> VInoth
>>
>> On Sun, Mar 15, 2020 at 12:19 PM selvaraj periyasamy <
>> selvaraj.periyasamy1983@gmail.com> wrote:
>>
>> > Below are the few files.
>> >
>> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09
>> >
>> >
>> /projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet
>> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09
>> >
>> >
>> /projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet
>> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09
>> >
>> >
>> /projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet
>> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09
>> >
>> >
>> /projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet
>> >
>> >
>> > On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy <
>> > selvaraj.periyasamy1983@gmail.com> wrote:
>> >
>> > > Team,
>> > >
>> > > I am using Hudi 0.5.0. While writing COW table with below code, many
>> > small
>> > > files with 15 MB size are getting created, where as total partition
>> size
>> > is
>> > > 300MB +
>> > >
>> > > val output = transDetailsDF.write.format("org.apache.hudi").
>> > > option("hoodie.insert.shuffle.parallelism", "2").
>> > > option("hoodie.upsert.shuffle.parallelism", "2").
>> > >
>> option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
>> > > option(OPERATION_OPT_KEY, "upsert").
>> > > option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
>> > > option(RECORDKEY_FIELD_OPT_KEY,"record_key").
>> > > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>> > > option(TABLE_NAME, tableName).
>> > >
>> > >
>> >
>> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
>> > > option("hoodie.memory.merge.max.size", "2004857600000").
>> > > option("hoodie.bloom.index.prune.by.ranges","false").
>> > > option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
>> > > option("hoodie.cleaner.commits.retained", 2).
>> > > option("hoodie.keep.min.commits",3).
>> > > option("hoodie.keep.max.commits",5).
>> > >
>> > > option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>> > >
>> > >
>> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
>> > > mode(Append).
>> > > save(basePath);
>> > > As per instruction provided in
>> > > https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set
>> > compactionSmallFileSize
>> > > to 100 MB and limitFileSize to 128 .
>> > >
>> > > Hadoop block size is 256 MB , I am looking for 128 MB files are
>> created.
>> > >
>> > > Am I missing any config here?
>> > >
>> > > Thanks,
>> > > Selva
>> > >
>> >
>>
>
Re: Small Files
Posted by selvaraj periyasamy <se...@gmail.com>.
Hi Vinoth,
I tired multiple runs. The total records expected in the
partition is 2765125. Below is the spark-shell command.
spark2-shell --jars hudi-spark-bundle-0.5.0-incubating.jar --conf
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --master
yarn --deploy-mode client --queue cybslarge --driver-memory 4g
--executor-memory 40g --num-executors 5 --executor-cores 5 --conf
'spark.executor.memoryOverhead=2048' --conf
'spark.dynamicAllocation.enabled=false' --conf
'spark.sql.hive.convertMetastoreParquet=false' --conf
'spark.rdd.compress=true' --conf 'spark.kryoserializer.buffer.max=512m' --
'spark.shuffle.service.enabled=true'
Dynamic allocation set to false
Attempt 1 -> Tried running mode is Overwrite and OPT_key is insert . Below
is the code.
transDetailsDF1.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "5").
option("hoodie.upsert.shuffle.parallelism", "5").
option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
option(OPERATION_OPT_KEY, "insert").
option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
option(RECORDKEY_FIELD_OPT_KEY,"record_key").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
option("hoodie.memory.merge.max.size", "2004857600000").
option("hoodie.bloom.index.prune.by.ranges","false").
option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
option("hoodie.cleaner.commits.retained",1).
option("hoodie.keep.min.commits",2).
option("hoodie.keep.max.commits",3).
option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
option("hoodie.copyonwrite.insert.split.size","2650000").
mode(Overwrite).
save(basePath);
Below are the files in HDFS .
Found 23 items
-rw-r--r-- 3 svchdc110p Hadoop_cdp 93 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
Attempt 2 -> Updated 10 records with Append mode and upsert key
transDetailsDF1.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "5").
option("hoodie.upsert.shuffle.parallelism", "5").
option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
option(OPERATION_OPT_KEY, "upsert").
option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
option(RECORDKEY_FIELD_OPT_KEY,"record_key").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
option("hoodie.memory.merge.max.size", "2004857600000").
option("hoodie.bloom.index.prune.by.ranges","false").
option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
option("hoodie.cleaner.commits.retained",1).
option("hoodie.keep.min.commits",2).
option("hoodie.keep.max.commits",3).
option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
option("hoodie.copyonwrite.insert.split.size","2650000").
mode(Append).
save(basePath);
Found 31 items
-rw-r--r-- 3 svchdc110p Hadoop_cdp 93 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/.hoodie_partition_metadata
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_1-95-1392_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/2fbeb924-8c1f-4430-ba89-441d49001f37-0_7-95-1398_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_13-95-1404_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/f600be63-62d3-4d5b-947b-34d965dfff2a-0_4-95-1395_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/4dbd2a39-3fb0-4a38-ad45-e0918030e99d-0_11-95-1402_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/29a94502-5336-4d11-9914-ee61761bc7ba-0_5-95-1396_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_2-95-1393_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/f069f68b-e697-40d2-a8d9-5ac8022a95c1-0_14-95-1405_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/04b1a6dd-acf1-4660-8890-24595b2824be-0_10-95-1401_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/8b7cf906-1c13-4c20-8a39-dbdf9ca473a2-0_8-95-1399_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b9b16103-0b86-45ba-8a09-929835876a68-0_3-95-1394_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/e6158d52-aa16-411a-99d6-8ca5c98ae9cd-0_6-95-1397_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_9-95-1400_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 15.0 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/bd082770-8013-4749-8825-7004b4e88d93-0_0-95-1391_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/a98de531-2581-4d91-b3ba-189d758a06f9-0_12-95-1403_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/9d875b04-1536-4b5d-bdd5-4d301019ca67-0_17-95-1408_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_21-95-1412_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/9e021dd4-3bf0-44ce-b002-23e10f39d7d0-0_15-95-1406_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.8 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/b424d41e-59d4-4561-9737-7fbbcfe8979c-0_19-95-1410_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_16-95-1407_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_20-95-1411_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:12
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_18-95-1409_20200316070822.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/29f8f9dc-f42a-4400-8d52-3d8229990b26-0_0-121-1585_20200316071437.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/e3d09802-d3bd-4f70-9076-753547d46c2c-0_6-121-1588_20200316071437.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/b09c041d-1b04-419c-9cbe-ff2394656086-0_4-121-1584_20200316071437.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/a4f56ae8-5a13-41a9-96ba-aa19bd6bb943-0_2-121-1582_20200316071437.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/c7e1b74a-91ad-4763-a477-9bc3e32626ce-0_3-121-1587_20200316071437.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/ee448e5b-e4f0-4d7a-bcce-b8022489acbd-0_5-121-1586_20200316071437.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/9e3b34b1-431e-497f-80ca-ba8bf3369142-0_1-121-1581_20200316071437.parquet
-rw-r--r-- 3 svchdc110p Hadoop_cdp 14.9 M 2020-03-16 07:14
/projects/transaction_details_hourly_hudi/20191201/11/b414d812-cf54-4e88-a2ce-2557a0ee980b-0_7-121-1583_20200316071437.parquet
In both the cases, files sizes are around 15 MB.
Thanks,
Selva
On Sun, Mar 15, 2020 at 11:16 PM Vinoth Chandar <vi...@apache.org> wrote:
> Hi Selva,
>
> Was this the first insert? Hudi handles small files by converting some
> inserts as updates to existing files. In this case, I see just one commit
> time, so there is nothing Hudi could optimize for.
> If you continue making updates/inserts over time, you should see these four
> files being expanded upto the configured limits, instead of new files being
> created..
>
> Let me know if that helps.. Also another config to pay attention to, in
> case of the first batch of inserts is
> http://hudi.apache.org/docs/configurations.html#insertSplitSize
>
> Thanks
> VInoth
>
> On Sun, Mar 15, 2020 at 12:19 PM selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
> > Below are the few files.
> >
> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09
> >
> >
> /projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet
> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09
> >
> >
> /projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet
> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09
> >
> >
> /projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet
> > -rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09
> >
> >
> /projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet
> >
> >
> > On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy <
> > selvaraj.periyasamy1983@gmail.com> wrote:
> >
> > > Team,
> > >
> > > I am using Hudi 0.5.0. While writing COW table with below code, many
> > small
> > > files with 15 MB size are getting created, where as total partition
> size
> > is
> > > 300MB +
> > >
> > > val output = transDetailsDF.write.format("org.apache.hudi").
> > > option("hoodie.insert.shuffle.parallelism", "2").
> > > option("hoodie.upsert.shuffle.parallelism", "2").
> > > option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> > > option(OPERATION_OPT_KEY, "upsert").
> > > option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> > > option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> > > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> > > option(TABLE_NAME, tableName).
> > >
> > >
> >
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> > > option("hoodie.memory.merge.max.size", "2004857600000").
> > > option("hoodie.bloom.index.prune.by.ranges","false").
> > > option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> > > option("hoodie.cleaner.commits.retained", 2).
> > > option("hoodie.keep.min.commits",3).
> > > option("hoodie.keep.max.commits",5).
> > >
> > > option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> > >
> > >
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> > > mode(Append).
> > > save(basePath);
> > > As per instruction provided in
> > > https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set
> > compactionSmallFileSize
> > > to 100 MB and limitFileSize to 128 .
> > >
> > > Hadoop block size is 256 MB , I am looking for 128 MB files are
> created.
> > >
> > > Am I missing any config here?
> > >
> > > Thanks,
> > > Selva
> > >
> >
>
Re: Small Files
Posted by Vinoth Chandar <vi...@apache.org>.
Hi Selva,
Was this the first insert? Hudi handles small files by converting some
inserts as updates to existing files. In this case, I see just one commit
time, so there is nothing Hudi could optimize for.
If you continue making updates/inserts over time, you should see these four
files being expanded upto the configured limits, instead of new files being
created..
Let me know if that helps.. Also another config to pay attention to, in
case of the first batch of inserts is
http://hudi.apache.org/docs/configurations.html#insertSplitSize
Thanks
VInoth
On Sun, Mar 15, 2020 at 12:19 PM selvaraj periyasamy <
selvaraj.periyasamy1983@gmail.com> wrote:
> Below are the few files.
>
> -rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09
>
> /projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet
> -rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09
>
> /projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet
> -rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09
>
> /projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet
> -rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09
>
> /projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet
>
>
> On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy <
> selvaraj.periyasamy1983@gmail.com> wrote:
>
> > Team,
> >
> > I am using Hudi 0.5.0. While writing COW table with below code, many
> small
> > files with 15 MB size are getting created, where as total partition size
> is
> > 300MB +
> >
> > val output = transDetailsDF.write.format("org.apache.hudi").
> > option("hoodie.insert.shuffle.parallelism", "2").
> > option("hoodie.upsert.shuffle.parallelism", "2").
> > option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> > option(OPERATION_OPT_KEY, "upsert").
> > option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> > option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> > option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> > option(TABLE_NAME, tableName).
> >
> >
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> > option("hoodie.memory.merge.max.size", "2004857600000").
> > option("hoodie.bloom.index.prune.by.ranges","false").
> > option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> > option("hoodie.cleaner.commits.retained", 2).
> > option("hoodie.keep.min.commits",3).
> > option("hoodie.keep.max.commits",5).
> >
> > option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
> >
> > option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> > mode(Append).
> > save(basePath);
> > As per instruction provided in
> > https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set
> compactionSmallFileSize
> > to 100 MB and limitFileSize to 128 .
> >
> > Hadoop block size is 256 MB , I am looking for 128 MB files are created.
> >
> > Am I missing any config here?
> >
> > Thanks,
> > Selva
> >
>
Re: Small Files
Posted by selvaraj periyasamy <se...@gmail.com>.
Below are the few files.
-rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09
/projects/20191201/10/da5d5747-91cb-4fd4-bd2a-1881cae8b1ba-0_12-253-3275_20200315190853.parquet
-rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09
/projects/20191201/10/8b111872-f797-4a24-990c-8854b7dcaf48-0_11-253-3274_20200315190853.parquet
-rw-r--r-- 3 dvcc Hadoop_cdp 15.2 M 2020-03-15 19:09
/projects/20191201/10/84b6aeb1-6c05-4a80-bf05-29256bbe03a7-0_17-253-3280_20200315190853.parquet
-rw-r--r-- 3 dvcc Hadoop_cdp 15.1 M 2020-03-15 19:09
/projects/20191201/10/2fd64689-aa67-4727-ac47-262680aad570-0_14-253-3277_20200315190853.parquet
On Sun, Mar 15, 2020 at 12:16 PM selvaraj periyasamy <
selvaraj.periyasamy1983@gmail.com> wrote:
> Team,
>
> I am using Hudi 0.5.0. While writing COW table with below code, many small
> files with 15 MB size are getting created, where as total partition size is
> 300MB +
>
> val output = transDetailsDF.write.format("org.apache.hudi").
> option("hoodie.insert.shuffle.parallelism", "2").
> option("hoodie.upsert.shuffle.parallelism", "2").
> option("hoodie.datasource.write.table.type","COPY_ON_WRITE").
> option(OPERATION_OPT_KEY, "upsert").
> option(PRECOMBINE_FIELD_OPT_KEY,"transaction_date").
> option(RECORDKEY_FIELD_OPT_KEY,"record_key").
> option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> option(TABLE_NAME, tableName).
>
> option("hoodie.datasource.write.payload.class","org.apache.hudi.OverwriteWithLatestAvroPayload_Custom").
> option("hoodie.memory.merge.max.size", "2004857600000").
> option("hoodie.bloom.index.prune.by.ranges","false").
> option("hoodie.cleaner.policy","KEEP_LATEST_FILE_VERSIONS").
> option("hoodie.cleaner.commits.retained", 2).
> option("hoodie.keep.min.commits",3).
> option("hoodie.keep.max.commits",5).
>
> option("hoodie.parquet.max.file.size",String.valueOf(128*1024*1024)).
>
> option("hoodie.parquet.small.file.limit",String.valueOf(100*1024*1024)).
> mode(Append).
> save(basePath);
> As per instruction provided in
> https://cwiki.apache.org/confluence/display/HUDI/FAQ , I set compactionSmallFileSize
> to 100 MB and limitFileSize to 128 .
>
> Hadoop block size is 256 MB , I am looking for 128 MB files are created.
>
> Am I missing any config here?
>
> Thanks,
> Selva
>