You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by ra...@gmail.com, ra...@gmail.com on 2019/04/03 13:27:00 UTC

Re: how to merge small parqut files in the hudi location


On 2019/03/13 12:57:59, rahuledavalath@gmail.com <ra...@gmail.com> wrote: 
> 
> 
> On 2019/03/13 08:42:13, Vinoth Chandar <vi...@apache.org> wrote: 
> > Hi Rahul,
> > 
> > Good to know. Yes for copy_on_write please turn off inline compaction.
> > (Probably explains why the default was false).
> > 
> > Thanks
> > Vinoth
> > 
> > On Wed, Mar 13, 2019 at 12:51 AM rahuledavalath@gmail.com <
> > rahuledavalath@gmail.com> wrote:
> > 
> > >
> > >
> > > On 2019/03/12 23:04:43, Vinoth Chandar <vi...@apache.org> wrote:
> > > > Opened up https://github.com/uber/hudi/pull/599/files to improve this
> > > > out-of-box
> > > >
> > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar <vi...@apache.org>
> > > wrote:
> > > >
> > > > > Hi Rahul,
> > > > >
> > > > > The files you shared all belong to same file group (they share the same
> > > > > prefix if you notice) (
> > > https://hudi.apache.org/concepts.html#terminologies
> > > > > ).
> > > > > Given its not creating new file groups every run, means the feature is
> > > > > kicking in.
> > > > >
> > > > > During each insert, Hudi will find the latest file in each file group
> > > (I,e
> > > > > the one with largest instant time, timestamp) and rewrite/expand that
> > > with
> > > > > the new inserts. Hudi does not clean up the old files immediately,
> > > since
> > > > > that can cause running queries to fail, since they could have started
> > > even
> > > > > hours ago (e.g Hive).
> > > > >
> > > > > If you want to reduce the number of files you see, you can lower
> > > number of
> > > > > commits retained
> > > > > https://hudi.apache.org/configurations.html#retainCommits
> > > > > We retain 24 by default.. i.e after the 25th file, the first one will
> > > be
> > > > > automatically cleaned..
> > > > >
> > > > > Does that make sense? Are you able to query this data and find the
> > > > > expected records?
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Tue, Mar 12, 2019 at 12:23 PM rahuledavalath@gmail.com <
> > > > > rahuledavalath@gmail.com> wrote:
> > > > >
> > > > >>
> > > > >>
> > > > >> On 2019/03/11 18:25:46, Vinoth Chandar <vi...@apache.org> wrote:
> > > > >> > Hi Rahul,
> > > > >> >
> > > > >> > Hudi/Copy-on-write storage would keep expanding your existing
> > > parquet
> > > > >> files
> > > > >> > to reach the configured file size, once you set the small file size
> > > > >> > config..
> > > > >> >
> > > > >> > For e.g: we at uber, write 1GB files this way.. to do that, you
> > > could
> > > > >> set
> > > > >> > something like this.
> > > > >> > http://hudi.apache.org/configurations.html#limitFileSize  = 1 *
> > > 1024 *
> > > > >> 1024
> > > > >> > * 1024
> > > > >> > http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > > =
> > > > >> 900 *
> > > > >> > 1024 * 1024
> > > > >> >
> > > > >> >
> > > > >> > Please let me know if you have trouble achieving this. Also please
> > > use
> > > > >> the
> > > > >> > insert operation (not bulk_insert) for this to work
> > > > >> >
> > > > >> >
> > > > >> > Thanks
> > > > >> > Vinoth
> > > > >> >
> > > > >> > On Mon, Mar 11, 2019 at 12:32 AM rahuledavalath@gmail.com <
> > > > >> > rahuledavalath@gmail.com> wrote:
> > > > >> >
> > > > >> > >
> > > > >> > >
> > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar <vi...@apache.org> wrote:
> > > > >> > > > Hi Rahul,
> > > > >> > > >
> > > > >> > > > you can try adding hoodie.parquet.small.file.limit=104857600, to
> > > > >> your
> > > > >> > > > property file to specify 100MB files. Note that this works only
> > > if
> > > > >> you
> > > > >> > > are
> > > > >> > > > using insert (not bulk_insert) operation. Hudi will enforce file
> > > > >> sizing
> > > > >> > > on
> > > > >> > > > ingest time. As of now, there is no support for collapsing these
> > > > >> file
> > > > >> > > > groups (parquet + related log files) into a large file group
> > > > >> (HIP/Design
> > > > >> > > > may come soon). Does that help?
> > > > >> > > >
> > > > >> > > > Also on the compaction in general, since you don't have any
> > > updates.
> > > > >> > > > I think you can simply use the copy_on_write storage? inserts
> > > will
> > > > >> go to
> > > > >> > > > the parquet file anyway on MOR..(but if you like to be able to
> > > deal
> > > > >> with
> > > > >> > > > updates later, understand where you are going)
> > > > >> > > >
> > > > >> > > > Thanks
> > > > >> > > > Vinoth
> > > > >> > > >
> > > > >> > > > On Fri, Mar 8, 2019 at 3:25 AM rahuledavalath@gmail.com <
> > > > >> > > > rahuledavalath@gmail.com> wrote:
> > > > >> > > >
> > > > >> > > > > Dear All
> > > > >> > > > >
> > > > >> > > > > I am using DeltaStreamer to stream the data from kafka topic
> > > and
> > > > >> to
> > > > >> > > write
> > > > >> > > > > it into the hudi data set.
> > > > >> > > > > For this use case I am not doing any upsert all are insert
> > > only
> > > > >> so each
> > > > >> > > > > job creates new parquet file after the inject job. So  large
> > > > >> number of
> > > > >> > > > > small files are creating. how can i  merge these files from
> > > > >> > > deltastreamer
> > > > >> > > > > job using the available configurations.
> > > > >> > > > >
> > > > >> > > > > I think compactionSmallFileSize may useful for this case,
> > > but i
> > > > >> am not
> > > > >> > > > > sure whether it is for deltastreamer or not. I tried it in
> > > > >> > > deltastreamer
> > > > >> > > > > but it did't worked. Please assist on this. If possible give
> > > one
> > > > >> > > example
> > > > >> > > > > for the same
> > > > >> > > > >
> > > > >> > > > > Thanks & Regards
> > > > >> > > > > Rahul
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> > >
> > > > >> > > Dear Vinoth
> > > > >> > >
> > > > >> > > For one of my use case , I doing only inserts.For testing i am
> > > > >> inserting
> > > > >> > > data which have 5-10 records only. I  am continuously pushing
> > > data to
> > > > >> hudi
> > > > >> > > dataset. As it is insert only for every insert it's creating  new
> > > > >> small
> > > > >> > > files to the dataset.
> > > > >> > >
> > > > >> > > If my insertion interval is less and i am planning for data to
> > > keep
> > > > >> for
> > > > >> > > years, this flow will create lots of small files.
> > > > >> > > I just want to know whether hudi can merge these small files in
> > > any
> > > > >> ways.
> > > > >> > >
> > > > >> > >
> > > > >> > > Thanks & Regards
> > > > >> > > Rahul P
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >> Dear Vinoth
> > > > >>
> > > > >> I tried below configurations.
> > > > >>
> > > > >> hoodie.parquet.max.file.size=1073741824
> > > > >> hoodie.parquet.small.file.limit=943718400
> > > > >>
> > > > >> I am using below code for inserting data from json kafka source.
> > > > >>
> > > > >> spark-submit --class
> > > > >> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
> > > > >> hoodie-utilities-0.4.5.jar --storage-type COPY_ON_WRITE --source-class
> > > > >> com.uber.hoodie.utilities.sources.JsonKafkaSource
> > > --source-ordering-field
> > > > >> stype  --target-base-path /MERGE --target-table MERGE --props
> > > > >> /hudi/kafka-source.properties  --schemaprovider-class
> > > > >> com.uber.hoodie.utilities.schema.FilebasedSchemaProvider --op insert
> > > > >>
> > > > >> But for each insert job it's creating new parquet file. It's not
> > > touching
> > > > >> old parquet files.
> > > > >>
> > > > >> For reference i am  sharing  some of the parquet files of hudi dataset
> > > > >> which are generating as part of DeltaStreamer data insertion.
> > > > >>
> > > > >> 93  /MERGE/2019/03/06/.hoodie_partition_metadata
> > > > >> 424.0 K
> > > > >>
> > > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002655.parquet
> > > > >> 424.0 K
> > > > >>
> > > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002733.parquet
> > > > >> 424.0 K
> > > > >>
> > > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002754.parquet
> > > > >> 424.0 K
> > > > >>
> > > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002815.parquet
> > > > >> 424.0 K
> > > > >>
> > > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002837.parquet
> > > > >> 424.0 K
> > > > >>
> > > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002859.parquet
> > > > >> 424.0 K
> > > > >>
> > > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002921.parquet
> > > > >> 424.0 K
> > > > >>
> > > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002942.parquet
> > > > >> 424.0 K
> > > > >>
> > > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003003.parquet
> > > > >> 424.0 K
> > > > >>
> > > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003024.parquet
> > > > >> 424.0 K
> > > > >>
> > > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003045.parquet
> > > > >>
> > > > >> Each job it's creating files of 424K & it's not merging any.  Can you
> > > > >> please confirm whether hudi can achieve the use case which i
> > > mentioned. If
> > > > >> this merging/compacting  feature is there, kindly tell what i am
> > > missing
> > > > >> here.
> > > > >>
> > > > >> Thanks & Regards
> > > > >> Rahul
> > > > >>
> > > > >>
> > > >
> > >
> > > Dear Vinoth
> > >
> > > I too verified that  the feature is kicking in.
> > > I am using below properties and my insert job is running with 10S interval.
> > >
> > > hoodie.cleaner.commits.retained=6
> > > hoodie.keep.max.commits=6
> > > hoodie.keep.min.commits=3
> > > hoodie.parquet.small.file.limit=943718400
> > > hoodie.parquet.max.file.size=1073741824
> > > hoodie.compact.inline=false
> > >
> > > Now i can see about 180 files in the hudi data set with
> > > hoodie.compact.inline=false.
> > > hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
> > > 181
> > >
> > > if set hoodie.compact.inline=true
> > > i am getiing below error
> > >
> > >  Loaded instants [[20190313131254__clean__COMPLETED],
> > > [20190313131254__commit__COMPLETED], [20190313131316__clean__COMPLETED],
> > > [20190313131316__commit__COMPLETED], [20190313131339__clean__COMPLETED],
> > > [20190313131339__commit__COMPLETED], [20190313131401__clean__COMPLETED],
> > > [20190313131401__commit__COMPLETED], [20190313131423__clean__COMPLETED],
> > > [20190313131423__commit__COMPLETED], [20190313131445__clean__COMPLETED],
> > > [20190313131445__commit__COMPLETED], [20190313131512__commit__COMPLETED]]
> > > Exception in thread "main"
> > > com.uber.hoodie.exception.HoodieNotSupportedException: Compaction is not
> > > supported from a CopyOnWrite table
> > >
> > >         at
> > > com.uber.hoodie.table.HoodieCopyOnWriteTable.scheduleCompaction(HoodieCopyOnWriteTable.java:168)
> > >
> > >
> > > please assist on this.
> > >
> > > Thanks & Regards
> > > Rahul
> > >
> > >
> > 
> 
> 
> 
> Dear Vinod
> 
> Previous mail i already  mentioned i am seeing more than 180 parquet files. 
> hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
> 181
> 
> I given commit to retain as 6(hoodie.cleaner.commits.retained=6) only. why then 181 files are coming. I am facing problem at this point.
> 
> Thanks & Regards
> Rahul 
> 

Dear Vinod

Now also i am facing same issue on COW table.  I think the clear job will invoke while spark-hudi loading time. But the old commit's  parquet files are still there. it's not cleaning. Can you please assist on this.

Thanks & Regards
Rahul

Re: how to merge small parqut files in the hudi location

Posted by nishith agarwal <n3...@gmail.com>.

Rahul,

Please make sure you are also setting the following config :

"hoodie.cleaner.policy" -> This config supports 2 policies :
KEEP_LATEST_FILE_VERSIONS,
KEEP_LATEST_COMMITS (This is the default policy)


If you are cleaning based on latest file versions, please set the
policy to KEEP_LATEST_FILE_VERSIONS

-Nishith


On Thu, Apr 4, 2019 at 9:03 AM Vinoth Chandar <vi...@apache.org> wrote:

> Hi rahul,
>
> Can you paste logs related to HoodieCleaner? That could give us clues
>
> Thanks
> Vinoth
>
> On Wed, Apr 3, 2019 at 11:00 PM rahuledavalath@gmail.com <
> rahuledavalath@gmail.com> wrote:
>
> >
> >
> > On 2019/04/04 00:41:15, Vinoth Chandar <vi...@apache.org> wrote:
> > > Hi Rahul,
> > >
> > > Sorry not following fully.. Are you saying cleaning is not triggered at
> > all
> > > or is cleaner not reclaiming older files? This definitely should be
> > > working,. So its mostly some config issue
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Wed, Apr 3, 2019 at 6:27 AM rahuledavalath@gmail.com <
> > > rahuledavalath@gmail.com> wrote:
> > >
> > > >
> > > >
> > > > On 2019/03/13 12:57:59, rahuledavalath@gmail.com <
> > rahuledavalath@gmail.com>
> > > > wrote:
> > > > >
> > > > >
> > > > > On 2019/03/13 08:42:13, Vinoth Chandar <vi...@apache.org> wrote:
> > > > > > Hi Rahul,
> > > > > >
> > > > > > Good to know. Yes for copy_on_write please turn off inline
> > compaction.
> > > > > > (Probably explains why the default was false).
> > > > > >
> > > > > > Thanks
> > > > > > Vinoth
> > > > > >
> > > > > > On Wed, Mar 13, 2019 at 12:51 AM rahuledavalath@gmail.com <
> > > > > > rahuledavalath@gmail.com> wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 2019/03/12 23:04:43, Vinoth Chandar <vi...@apache.org>
> > wrote:
> > > > > > > > Opened up https://github.com/uber/hudi/pull/599/files to
> > improve
> > > > this
> > > > > > > > out-of-box
> > > > > > > >
> > > > > > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar <
> > vinoth@apache.org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Rahul,
> > > > > > > > >
> > > > > > > > > The files you shared all belong to same file group (they
> > share
> > > > the same
> > > > > > > > > prefix if you notice) (
> > > > > > > https://hudi.apache.org/concepts.html#terminologies
> > > > > > > > > ).
> > > > > > > > > Given its not creating new file groups every run, means the
> > > > feature is
> > > > > > > > > kicking in.
> > > > > > > > >
> > > > > > > > > During each insert, Hudi will find the latest file in each
> > file
> > > > group
> > > > > > > (I,e
> > > > > > > > > the one with largest instant time, timestamp) and
> > rewrite/expand
> > > > that
> > > > > > > with
> > > > > > > > > the new inserts. Hudi does not clean up the old files
> > > > immediately,
> > > > > > > since
> > > > > > > > > that can cause running queries to fail, since they could
> have
> > > > started
> > > > > > > even
> > > > > > > > > hours ago (e.g Hive).
> > > > > > > > >
> > > > > > > > > If you want to reduce the number of files you see, you can
> > lower
> > > > > > > number of
> > > > > > > > > commits retained
> > > > > > > > > https://hudi.apache.org/configurations.html#retainCommits
> > > > > > > > > We retain 24 by default.. i.e after the 25th file, the
> first
> > one
> > > > will
> > > > > > > be
> > > > > > > > > automatically cleaned..
> > > > > > > > >
> > > > > > > > > Does that make sense? Are you able to query this data and
> > find
> > > > the
> > > > > > > > > expected records?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Vinoth
> > > > > > > > >
> > > > > > > > > On Tue, Mar 12, 2019 at 12:23 PM rahuledavalath@gmail.com
> <
> > > > > > > > > rahuledavalath@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> On 2019/03/11 18:25:46, Vinoth Chandar <vinoth@apache.org
> >
> > > > wrote:
> > > > > > > > >> > Hi Rahul,
> > > > > > > > >> >
> > > > > > > > >> > Hudi/Copy-on-write storage would keep expanding your
> > existing
> > > > > > > parquet
> > > > > > > > >> files
> > > > > > > > >> > to reach the configured file size, once you set the
> small
> > > > file size
> > > > > > > > >> > config..
> > > > > > > > >> >
> > > > > > > > >> > For e.g: we at uber, write 1GB files this way.. to do
> > that,
> > > > you
> > > > > > > could
> > > > > > > > >> set
> > > > > > > > >> > something like this.
> > > > > > > > >> >
> http://hudi.apache.org/configurations.html#limitFileSize
> > =
> > > > 1 *
> > > > > > > 1024 *
> > > > > > > > >> 1024
> > > > > > > > >> > * 1024
> > > > > > > > >> >
> > > > http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > > > > > > =
> > > > > > > > >> 900 *
> > > > > > > > >> > 1024 * 1024
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > Please let me know if you have trouble achieving this.
> > Also
> > > > please
> > > > > > > use
> > > > > > > > >> the
> > > > > > > > >> > insert operation (not bulk_insert) for this to work
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > Thanks
> > > > > > > > >> > Vinoth
> > > > > > > > >> >
> > > > > > > > >> > On Mon, Mar 11, 2019 at 12:32 AM
> rahuledavalath@gmail.com
> > <
> > > > > > > > >> > rahuledavalath@gmail.com> wrote:
> > > > > > > > >> >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar <
> > vinoth@apache.org>
> > > > wrote:
> > > > > > > > >> > > > Hi Rahul,
> > > > > > > > >> > > >
> > > > > > > > >> > > > you can try adding
> > > > hoodie.parquet.small.file.limit=104857600, to
> > > > > > > > >> your
> > > > > > > > >> > > > property file to specify 100MB files. Note that this
> > > > works only
> > > > > > > if
> > > > > > > > >> you
> > > > > > > > >> > > are
> > > > > > > > >> > > > using insert (not bulk_insert) operation. Hudi will
> > > > enforce file
> > > > > > > > >> sizing
> > > > > > > > >> > > on
> > > > > > > > >> > > > ingest time. As of now, there is no support for
> > > > collapsing these
> > > > > > > > >> file
> > > > > > > > >> > > > groups (parquet + related log files) into a large
> file
> > > > group
> > > > > > > > >> (HIP/Design
> > > > > > > > >> > > > may come soon). Does that help?
> > > > > > > > >> > > >
> > > > > > > > >> > > > Also on the compaction in general, since you don't
> > have
> > > > any
> > > > > > > updates.
> > > > > > > > >> > > > I think you can simply use the copy_on_write
> storage?
> > > > inserts
> > > > > > > will
> > > > > > > > >> go to
> > > > > > > > >> > > > the parquet file anyway on MOR..(but if you like to
> be
> > > > able to
> > > > > > > deal
> > > > > > > > >> with
> > > > > > > > >> > > > updates later, understand where you are going)
> > > > > > > > >> > > >
> > > > > > > > >> > > > Thanks
> > > > > > > > >> > > > Vinoth
> > > > > > > > >> > > >
> > > > > > > > >> > > > On Fri, Mar 8, 2019 at 3:25 AM
> > rahuledavalath@gmail.com <
> > > > > > > > >> > > > rahuledavalath@gmail.com> wrote:
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Dear All
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > I am using DeltaStreamer to stream the data from
> > kafka
> > > > topic
> > > > > > > and
> > > > > > > > >> to
> > > > > > > > >> > > write
> > > > > > > > >> > > > > it into the hudi data set.
> > > > > > > > >> > > > > For this use case I am not doing any upsert all
> are
> > > > insert
> > > > > > > only
> > > > > > > > >> so each
> > > > > > > > >> > > > > job creates new parquet file after the inject job.
> > So
> > > > large
> > > > > > > > >> number of
> > > > > > > > >> > > > > small files are creating. how can i  merge these
> > files
> > > > from
> > > > > > > > >> > > deltastreamer
> > > > > > > > >> > > > > job using the available configurations.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > I think compactionSmallFileSize may useful for
> this
> > > > case,
> > > > > > > but i
> > > > > > > > >> am not
> > > > > > > > >> > > > > sure whether it is for deltastreamer or not. I
> > tried it
> > > > in
> > > > > > > > >> > > deltastreamer
> > > > > > > > >> > > > > but it did't worked. Please assist on this. If
> > possible
> > > > give
> > > > > > > one
> > > > > > > > >> > > example
> > > > > > > > >> > > > > for the same
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Thanks & Regards
> > > > > > > > >> > > > > Rahul
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > Dear Vinoth
> > > > > > > > >> > >
> > > > > > > > >> > > For one of my use case , I doing only inserts.For
> > testing i
> > > > am
> > > > > > > > >> inserting
> > > > > > > > >> > > data which have 5-10 records only. I  am continuously
> > > > pushing
> > > > > > > data to
> > > > > > > > >> hudi
> > > > > > > > >> > > dataset. As it is insert only for every insert it's
> > > > creating  new
> > > > > > > > >> small
> > > > > > > > >> > > files to the dataset.
> > > > > > > > >> > >
> > > > > > > > >> > > If my insertion interval is less and i am planning for
> > data
> > > > to
> > > > > > > keep
> > > > > > > > >> for
> > > > > > > > >> > > years, this flow will create lots of small files.
> > > > > > > > >> > > I just want to know whether hudi can merge these small
> > > > files in
> > > > > > > any
> > > > > > > > >> ways.
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > Thanks & Regards
> > > > > > > > >> > > Rahul P
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >> Dear Vinoth
> > > > > > > > >>
> > > > > > > > >> I tried below configurations.
> > > > > > > > >>
> > > > > > > > >> hoodie.parquet.max.file.size=1073741824
> > > > > > > > >> hoodie.parquet.small.file.limit=943718400
> > > > > > > > >>
> > > > > > > > >> I am using below code for inserting data from json kafka
> > source.
> > > > > > > > >>
> > > > > > > > >> spark-submit --class
> > > > > > > > >>
> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
> > > > > > > > >> hoodie-utilities-0.4.5.jar --storage-type COPY_ON_WRITE
> > > > --source-class
> > > > > > > > >> com.uber.hoodie.utilities.sources.JsonKafkaSource
> > > > > > > --source-ordering-field
> > > > > > > > >> stype  --target-base-path /MERGE --target-table MERGE
> > --props
> > > > > > > > >> /hudi/kafka-source.properties  --schemaprovider-class
> > > > > > > > >> com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
> > --op
> > > > insert
> > > > > > > > >>
> > > > > > > > >> But for each insert job it's creating new parquet file.
> > It's not
> > > > > > > touching
> > > > > > > > >> old parquet files.
> > > > > > > > >>
> > > > > > > > >> For reference i am  sharing  some of the parquet files of
> > hudi
> > > > dataset
> > > > > > > > >> which are generating as part of DeltaStreamer data
> > insertion.
> > > > > > > > >>
> > > > > > > > >> 93  /MERGE/2019/03/06/.hoodie_partition_metadata
> > > > > > > > >> 424.0 K
> > > > > > > > >>
> > > > > > >
> > > >
> >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002655.parquet
> > > > > > > > >> 424.0 K
> > > > > > > > >>
> > > > > > >
> > > >
> >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002733.parquet
> > > > > > > > >> 424.0 K
> > > > > > > > >>
> > > > > > >
> > > >
> >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002754.parquet
> > > > > > > > >> 424.0 K
> > > > > > > > >>
> > > > > > >
> > > >
> >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002815.parquet
> > > > > > > > >> 424.0 K
> > > > > > > > >>
> > > > > > >
> > > >
> >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002837.parquet
> > > > > > > > >> 424.0 K
> > > > > > > > >>
> > > > > > >
> > > >
> >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002859.parquet
> > > > > > > > >> 424.0 K
> > > > > > > > >>
> > > > > > >
> > > >
> >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002921.parquet
> > > > > > > > >> 424.0 K
> > > > > > > > >>
> > > > > > >
> > > >
> >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002942.parquet
> > > > > > > > >> 424.0 K
> > > > > > > > >>
> > > > > > >
> > > >
> >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003003.parquet
> > > > > > > > >> 424.0 K
> > > > > > > > >>
> > > > > > >
> > > >
> >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003024.parquet
> > > > > > > > >> 424.0 K
> > > > > > > > >>
> > > > > > >
> > > >
> >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003045.parquet
> > > > > > > > >>
> > > > > > > > >> Each job it's creating files of 424K & it's not merging
> any.
> > > > Can you
> > > > > > > > >> please confirm whether hudi can achieve the use case
> which i
> > > > > > > mentioned. If
> > > > > > > > >> this merging/compacting  feature is there, kindly tell
> what
> > i am
> > > > > > > missing
> > > > > > > > >> here.
> > > > > > > > >>
> > > > > > > > >> Thanks & Regards
> > > > > > > > >> Rahul
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > > > Dear Vinoth
> > > > > > >
> > > > > > > I too verified that  the feature is kicking in.
> > > > > > > I am using below properties and my insert job is running with
> 10S
> > > > interval.
> > > > > > >
> > > > > > > hoodie.cleaner.commits.retained=6
> > > > > > > hoodie.keep.max.commits=6
> > > > > > > hoodie.keep.min.commits=3
> > > > > > > hoodie.parquet.small.file.limit=943718400
> > > > > > > hoodie.parquet.max.file.size=1073741824
> > > > > > > hoodie.compact.inline=false
> > > > > > >
> > > > > > > Now i can see about 180 files in the hudi data set with
> > > > > > > hoodie.compact.inline=false.
> > > > > > > hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
> > > > > > > 181
> > > > > > >
> > > > > > > if set hoodie.compact.inline=true
> > > > > > > i am getiing below error
> > > > > > >
> > > > > > >  Loaded instants [[20190313131254__clean__COMPLETED],
> > > > > > > [20190313131254__commit__COMPLETED],
> > > > [20190313131316__clean__COMPLETED],
> > > > > > > [20190313131316__commit__COMPLETED],
> > > > [20190313131339__clean__COMPLETED],
> > > > > > > [20190313131339__commit__COMPLETED],
> > > > [20190313131401__clean__COMPLETED],
> > > > > > > [20190313131401__commit__COMPLETED],
> > > > [20190313131423__clean__COMPLETED],
> > > > > > > [20190313131423__commit__COMPLETED],
> > > > [20190313131445__clean__COMPLETED],
> > > > > > > [20190313131445__commit__COMPLETED],
> > > > [20190313131512__commit__COMPLETED]]
> > > > > > > Exception in thread "main"
> > > > > > > com.uber.hoodie.exception.HoodieNotSupportedException:
> > Compaction is
> > > > not
> > > > > > > supported from a CopyOnWrite table
> > > > > > >
> > > > > > >         at
> > > > > > >
> > > >
> >
> com.uber.hoodie.table.HoodieCopyOnWriteTable.scheduleCompaction(HoodieCopyOnWriteTable.java:168)
> > > > > > >
> > > > > > >
> > > > > > > please assist on this.
> > > > > > >
> > > > > > > Thanks & Regards
> > > > > > > Rahul
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > Dear Vinod
> > > > >
> > > > > Previous mail i already  mentioned i am seeing more than 180
> parquet
> > > > files.
> > > > > hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
> > > > > 181
> > > > >
> > > > > I given commit to retain as 6(hoodie.cleaner.commits.retained=6)
> > only.
> > > > why then 181 files are coming. I am facing problem at this point.
> > > > >
> > > > > Thanks & Regards
> > > > > Rahul
> > > > >
> > > >
> > > > Dear Vinod
> > > >
> > > > Now also i am facing same issue on COW table.  I think the clear job
> > will
> > > > invoke while spark-hudi loading time. But the old commit's  parquet
> > files
> > > > are still there. it's not cleaning. Can you please assist on this.
> > > >
> > > > Thanks & Regards
> > > > Rahul
> > > >
> > >
> >
> > Dear Vinod
> >
> > as per the config document i am using hoodie.cleaner.commits.retained=6
> > for my test case.
> > but i can see more than 6 parquet files for same file group.
> >
> > i too also think it's config issue. So please tell any other config is
> > affecting this. Less size data only i am using to check this case.
> >
> > while checking some code i have  found String
> > CLEANER_FILE_VERSIONS_RETAINED_PROP =
> >       "hoodie.cleaner.fileversions" + ".retained";
> > it's not mentioned in config document.  So i tried
> > hoodie.cleaner.fileversions.retained=6 also, but same issue.
> >
> > sample files in my hudi location
> >
> > 434616 2019-04-03 19:32
> >
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193219.parquet
> > 434722 2019-04-03 19:32
> >
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193240.parquet
> > 434765 2019-04-03 19:33
> >
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193258.parquet
> > 434855 2019-04-03 19:33
> >
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193319.parquet
> > 434903 2019-04-03 19:33
> >
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193341.parquet
> > 434954 2019-04-03 19:34
> >
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193400.parquet
> > 434992 2019-04-03 19:35
> >
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193545.parquet
> > 435030 2019-04-03 19:36
> >
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193640.parquet
> >
> >
> > please assist on this. As it is a basic feature i am not able to go
> > further.
> >
> > Thanks & Regards
> > Rahul
> >
>

Re: how to merge small parqut files in the hudi location

Posted by Vinoth Chandar <vi...@apache.org>.

Hi rahul,

Can you paste logs related to HoodieCleaner? That could give us clues

Thanks
Vinoth

On Wed, Apr 3, 2019 at 11:00 PM rahuledavalath@gmail.com <
rahuledavalath@gmail.com> wrote:

>
>
> On 2019/04/04 00:41:15, Vinoth Chandar <vi...@apache.org> wrote:
> > Hi Rahul,
> >
> > Sorry not following fully.. Are you saying cleaning is not triggered at
> all
> > or is cleaner not reclaiming older files? This definitely should be
> > working,. So its mostly some config issue
> >
> > Thanks
> > Vinoth
> >
> > On Wed, Apr 3, 2019 at 6:27 AM rahuledavalath@gmail.com <
> > rahuledavalath@gmail.com> wrote:
> >
> > >
> > >
> > > On 2019/03/13 12:57:59, rahuledavalath@gmail.com <
> rahuledavalath@gmail.com>
> > > wrote:
> > > >
> > > >
> > > > On 2019/03/13 08:42:13, Vinoth Chandar <vi...@apache.org> wrote:
> > > > > Hi Rahul,
> > > > >
> > > > > Good to know. Yes for copy_on_write please turn off inline
> compaction.
> > > > > (Probably explains why the default was false).
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > > > On Wed, Mar 13, 2019 at 12:51 AM rahuledavalath@gmail.com <
> > > > > rahuledavalath@gmail.com> wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > On 2019/03/12 23:04:43, Vinoth Chandar <vi...@apache.org>
> wrote:
> > > > > > > Opened up https://github.com/uber/hudi/pull/599/files to
> improve
> > > this
> > > > > > > out-of-box
> > > > > > >
> > > > > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar <
> vinoth@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Rahul,
> > > > > > > >
> > > > > > > > The files you shared all belong to same file group (they
> share
> > > the same
> > > > > > > > prefix if you notice) (
> > > > > > https://hudi.apache.org/concepts.html#terminologies
> > > > > > > > ).
> > > > > > > > Given its not creating new file groups every run, means the
> > > feature is
> > > > > > > > kicking in.
> > > > > > > >
> > > > > > > > During each insert, Hudi will find the latest file in each
> file
> > > group
> > > > > > (I,e
> > > > > > > > the one with largest instant time, timestamp) and
> rewrite/expand
> > > that
> > > > > > with
> > > > > > > > the new inserts. Hudi does not clean up the old files
> > > immediately,
> > > > > > since
> > > > > > > > that can cause running queries to fail, since they could have
> > > started
> > > > > > even
> > > > > > > > hours ago (e.g Hive).
> > > > > > > >
> > > > > > > > If you want to reduce the number of files you see, you can
> lower
> > > > > > number of
> > > > > > > > commits retained
> > > > > > > > https://hudi.apache.org/configurations.html#retainCommits
> > > > > > > > We retain 24 by default.. i.e after the 25th file, the first
> one
> > > will
> > > > > > be
> > > > > > > > automatically cleaned..
> > > > > > > >
> > > > > > > > Does that make sense? Are you able to query this data and
> find
> > > the
> > > > > > > > expected records?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Vinoth
> > > > > > > >
> > > > > > > > On Tue, Mar 12, 2019 at 12:23 PM rahuledavalath@gmail.com <
> > > > > > > > rahuledavalath@gmail.com> wrote:
> > > > > > > >
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On 2019/03/11 18:25:46, Vinoth Chandar <vi...@apache.org>
> > > wrote:
> > > > > > > >> > Hi Rahul,
> > > > > > > >> >
> > > > > > > >> > Hudi/Copy-on-write storage would keep expanding your
> existing
> > > > > > parquet
> > > > > > > >> files
> > > > > > > >> > to reach the configured file size, once you set the small
> > > file size
> > > > > > > >> > config..
> > > > > > > >> >
> > > > > > > >> > For e.g: we at uber, write 1GB files this way.. to do
> that,
> > > you
> > > > > > could
> > > > > > > >> set
> > > > > > > >> > something like this.
> > > > > > > >> > http://hudi.apache.org/configurations.html#limitFileSize
> =
> > > 1 *
> > > > > > 1024 *
> > > > > > > >> 1024
> > > > > > > >> > * 1024
> > > > > > > >> >
> > > http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > > > > > =
> > > > > > > >> 900 *
> > > > > > > >> > 1024 * 1024
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > Please let me know if you have trouble achieving this.
> Also
> > > please
> > > > > > use
> > > > > > > >> the
> > > > > > > >> > insert operation (not bulk_insert) for this to work
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > Thanks
> > > > > > > >> > Vinoth
> > > > > > > >> >
> > > > > > > >> > On Mon, Mar 11, 2019 at 12:32 AM rahuledavalath@gmail.com
> <
> > > > > > > >> > rahuledavalath@gmail.com> wrote:
> > > > > > > >> >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar <
> vinoth@apache.org>
> > > wrote:
> > > > > > > >> > > > Hi Rahul,
> > > > > > > >> > > >
> > > > > > > >> > > > you can try adding
> > > hoodie.parquet.small.file.limit=104857600, to
> > > > > > > >> your
> > > > > > > >> > > > property file to specify 100MB files. Note that this
> > > works only
> > > > > > if
> > > > > > > >> you
> > > > > > > >> > > are
> > > > > > > >> > > > using insert (not bulk_insert) operation. Hudi will
> > > enforce file
> > > > > > > >> sizing
> > > > > > > >> > > on
> > > > > > > >> > > > ingest time. As of now, there is no support for
> > > collapsing these
> > > > > > > >> file
> > > > > > > >> > > > groups (parquet + related log files) into a large file
> > > group
> > > > > > > >> (HIP/Design
> > > > > > > >> > > > may come soon). Does that help?
> > > > > > > >> > > >
> > > > > > > >> > > > Also on the compaction in general, since you don't
> have
> > > any
> > > > > > updates.
> > > > > > > >> > > > I think you can simply use the copy_on_write storage?
> > > inserts
> > > > > > will
> > > > > > > >> go to
> > > > > > > >> > > > the parquet file anyway on MOR..(but if you like to be
> > > able to
> > > > > > deal
> > > > > > > >> with
> > > > > > > >> > > > updates later, understand where you are going)
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks
> > > > > > > >> > > > Vinoth
> > > > > > > >> > > >
> > > > > > > >> > > > On Fri, Mar 8, 2019 at 3:25 AM
> rahuledavalath@gmail.com <
> > > > > > > >> > > > rahuledavalath@gmail.com> wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Dear All
> > > > > > > >> > > > >
> > > > > > > >> > > > > I am using DeltaStreamer to stream the data from
> kafka
> > > topic
> > > > > > and
> > > > > > > >> to
> > > > > > > >> > > write
> > > > > > > >> > > > > it into the hudi data set.
> > > > > > > >> > > > > For this use case I am not doing any upsert all are
> > > insert
> > > > > > only
> > > > > > > >> so each
> > > > > > > >> > > > > job creates new parquet file after the inject job.
> So
> > > large
> > > > > > > >> number of
> > > > > > > >> > > > > small files are creating. how can i  merge these
> files
> > > from
> > > > > > > >> > > deltastreamer
> > > > > > > >> > > > > job using the available configurations.
> > > > > > > >> > > > >
> > > > > > > >> > > > > I think compactionSmallFileSize may useful for this
> > > case,
> > > > > > but i
> > > > > > > >> am not
> > > > > > > >> > > > > sure whether it is for deltastreamer or not. I
> tried it
> > > in
> > > > > > > >> > > deltastreamer
> > > > > > > >> > > > > but it did't worked. Please assist on this. If
> possible
> > > give
> > > > > > one
> > > > > > > >> > > example
> > > > > > > >> > > > > for the same
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks & Regards
> > > > > > > >> > > > > Rahul
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > Dear Vinoth
> > > > > > > >> > >
> > > > > > > >> > > For one of my use case , I doing only inserts.For
> testing i
> > > am
> > > > > > > >> inserting
> > > > > > > >> > > data which have 5-10 records only. I  am continuously
> > > pushing
> > > > > > data to
> > > > > > > >> hudi
> > > > > > > >> > > dataset. As it is insert only for every insert it's
> > > creating  new
> > > > > > > >> small
> > > > > > > >> > > files to the dataset.
> > > > > > > >> > >
> > > > > > > >> > > If my insertion interval is less and i am planning for
> data
> > > to
> > > > > > keep
> > > > > > > >> for
> > > > > > > >> > > years, this flow will create lots of small files.
> > > > > > > >> > > I just want to know whether hudi can merge these small
> > > files in
> > > > > > any
> > > > > > > >> ways.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > Thanks & Regards
> > > > > > > >> > > Rahul P
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >> Dear Vinoth
> > > > > > > >>
> > > > > > > >> I tried below configurations.
> > > > > > > >>
> > > > > > > >> hoodie.parquet.max.file.size=1073741824
> > > > > > > >> hoodie.parquet.small.file.limit=943718400
> > > > > > > >>
> > > > > > > >> I am using below code for inserting data from json kafka
> source.
> > > > > > > >>
> > > > > > > >> spark-submit --class
> > > > > > > >> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
> > > > > > > >> hoodie-utilities-0.4.5.jar --storage-type COPY_ON_WRITE
> > > --source-class
> > > > > > > >> com.uber.hoodie.utilities.sources.JsonKafkaSource
> > > > > > --source-ordering-field
> > > > > > > >> stype  --target-base-path /MERGE --target-table MERGE
> --props
> > > > > > > >> /hudi/kafka-source.properties  --schemaprovider-class
> > > > > > > >> com.uber.hoodie.utilities.schema.FilebasedSchemaProvider
> --op
> > > insert
> > > > > > > >>
> > > > > > > >> But for each insert job it's creating new parquet file.
> It's not
> > > > > > touching
> > > > > > > >> old parquet files.
> > > > > > > >>
> > > > > > > >> For reference i am  sharing  some of the parquet files of
> hudi
> > > dataset
> > > > > > > >> which are generating as part of DeltaStreamer data
> insertion.
> > > > > > > >>
> > > > > > > >> 93  /MERGE/2019/03/06/.hoodie_partition_metadata
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002655.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002733.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002754.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002815.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002837.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002859.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002921.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002942.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003003.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003024.parquet
> > > > > > > >> 424.0 K
> > > > > > > >>
> > > > > >
> > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003045.parquet
> > > > > > > >>
> > > > > > > >> Each job it's creating files of 424K & it's not merging any.
> > > Can you
> > > > > > > >> please confirm whether hudi can achieve the use case which i
> > > > > > mentioned. If
> > > > > > > >> this merging/compacting  feature is there, kindly tell what
> i am
> > > > > > missing
> > > > > > > >> here.
> > > > > > > >>
> > > > > > > >> Thanks & Regards
> > > > > > > >> Rahul
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > > > Dear Vinoth
> > > > > >
> > > > > > I too verified that  the feature is kicking in.
> > > > > > I am using below properties and my insert job is running with 10S
> > > interval.
> > > > > >
> > > > > > hoodie.cleaner.commits.retained=6
> > > > > > hoodie.keep.max.commits=6
> > > > > > hoodie.keep.min.commits=3
> > > > > > hoodie.parquet.small.file.limit=943718400
> > > > > > hoodie.parquet.max.file.size=1073741824
> > > > > > hoodie.compact.inline=false
> > > > > >
> > > > > > Now i can see about 180 files in the hudi data set with
> > > > > > hoodie.compact.inline=false.
> > > > > > hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
> > > > > > 181
> > > > > >
> > > > > > if set hoodie.compact.inline=true
> > > > > > i am getiing below error
> > > > > >
> > > > > >  Loaded instants [[20190313131254__clean__COMPLETED],
> > > > > > [20190313131254__commit__COMPLETED],
> > > [20190313131316__clean__COMPLETED],
> > > > > > [20190313131316__commit__COMPLETED],
> > > [20190313131339__clean__COMPLETED],
> > > > > > [20190313131339__commit__COMPLETED],
> > > [20190313131401__clean__COMPLETED],
> > > > > > [20190313131401__commit__COMPLETED],
> > > [20190313131423__clean__COMPLETED],
> > > > > > [20190313131423__commit__COMPLETED],
> > > [20190313131445__clean__COMPLETED],
> > > > > > [20190313131445__commit__COMPLETED],
> > > [20190313131512__commit__COMPLETED]]
> > > > > > Exception in thread "main"
> > > > > > com.uber.hoodie.exception.HoodieNotSupportedException:
> Compaction is
> > > not
> > > > > > supported from a CopyOnWrite table
> > > > > >
> > > > > >         at
> > > > > >
> > >
> com.uber.hoodie.table.HoodieCopyOnWriteTable.scheduleCompaction(HoodieCopyOnWriteTable.java:168)
> > > > > >
> > > > > >
> > > > > > please assist on this.
> > > > > >
> > > > > > Thanks & Regards
> > > > > > Rahul
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > Dear Vinod
> > > >
> > > > Previous mail i already  mentioned i am seeing more than 180 parquet
> > > files.
> > > > hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
> > > > 181
> > > >
> > > > I given commit to retain as 6(hoodie.cleaner.commits.retained=6)
> only.
> > > why then 181 files are coming. I am facing problem at this point.
> > > >
> > > > Thanks & Regards
> > > > Rahul
> > > >
> > >
> > > Dear Vinod
> > >
> > > Now also i am facing same issue on COW table.  I think the clear job
> will
> > > invoke while spark-hudi loading time. But the old commit's  parquet
> files
> > > are still there. it's not cleaning. Can you please assist on this.
> > >
> > > Thanks & Regards
> > > Rahul
> > >
> >
>
> Dear Vinod
>
> as per the config document i am using hoodie.cleaner.commits.retained=6
> for my test case.
> but i can see more than 6 parquet files for same file group.
>
> i too also think it's config issue. So please tell any other config is
> affecting this. Less size data only i am using to check this case.
>
> while checking some code i have  found String
> CLEANER_FILE_VERSIONS_RETAINED_PROP =
>       "hoodie.cleaner.fileversions" + ".retained";
> it's not mentioned in config document.  So i tried
> hoodie.cleaner.fileversions.retained=6 also, but same issue.
>
> sample files in my hudi location
>
> 434616 2019-04-03 19:32
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193219.parquet
> 434722 2019-04-03 19:32
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193240.parquet
> 434765 2019-04-03 19:33
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193258.parquet
> 434855 2019-04-03 19:33
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193319.parquet
> 434903 2019-04-03 19:33
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193341.parquet
> 434954 2019-04-03 19:34
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193400.parquet
> 434992 2019-04-03 19:35
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193545.parquet
> 435030 2019-04-03 19:36
> /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193640.parquet
>
>
> please assist on this. As it is a basic feature i am not able to go
> further.
>
> Thanks & Regards
> Rahul
>

Re: how to merge small parqut files in the hudi location

Posted by ra...@gmail.com, ra...@gmail.com.


On 2019/04/04 00:41:15, Vinoth Chandar <vi...@apache.org> wrote: 
> Hi Rahul,
> 
> Sorry not following fully.. Are you saying cleaning is not triggered at all
> or is cleaner not reclaiming older files? This definitely should be
> working,. So its mostly some config issue
> 
> Thanks
> Vinoth
> 
> On Wed, Apr 3, 2019 at 6:27 AM rahuledavalath@gmail.com <
> rahuledavalath@gmail.com> wrote:
> 
> >
> >
> > On 2019/03/13 12:57:59, rahuledavalath@gmail.com <ra...@gmail.com>
> > wrote:
> > >
> > >
> > > On 2019/03/13 08:42:13, Vinoth Chandar <vi...@apache.org> wrote:
> > > > Hi Rahul,
> > > >
> > > > Good to know. Yes for copy_on_write please turn off inline compaction.
> > > > (Probably explains why the default was false).
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Wed, Mar 13, 2019 at 12:51 AM rahuledavalath@gmail.com <
> > > > rahuledavalath@gmail.com> wrote:
> > > >
> > > > >
> > > > >
> > > > > On 2019/03/12 23:04:43, Vinoth Chandar <vi...@apache.org> wrote:
> > > > > > Opened up https://github.com/uber/hudi/pull/599/files to improve
> > this
> > > > > > out-of-box
> > > > > >
> > > > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar <vi...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Rahul,
> > > > > > >
> > > > > > > The files you shared all belong to same file group (they share
> > the same
> > > > > > > prefix if you notice) (
> > > > > https://hudi.apache.org/concepts.html#terminologies
> > > > > > > ).
> > > > > > > Given its not creating new file groups every run, means the
> > feature is
> > > > > > > kicking in.
> > > > > > >
> > > > > > > During each insert, Hudi will find the latest file in each file
> > group
> > > > > (I,e
> > > > > > > the one with largest instant time, timestamp) and rewrite/expand
> > that
> > > > > with
> > > > > > > the new inserts. Hudi does not clean up the old files
> > immediately,
> > > > > since
> > > > > > > that can cause running queries to fail, since they could have
> > started
> > > > > even
> > > > > > > hours ago (e.g Hive).
> > > > > > >
> > > > > > > If you want to reduce the number of files you see, you can lower
> > > > > number of
> > > > > > > commits retained
> > > > > > > https://hudi.apache.org/configurations.html#retainCommits
> > > > > > > We retain 24 by default.. i.e after the 25th file, the first one
> > will
> > > > > be
> > > > > > > automatically cleaned..
> > > > > > >
> > > > > > > Does that make sense? Are you able to query this data and find
> > the
> > > > > > > expected records?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vinoth
> > > > > > >
> > > > > > > On Tue, Mar 12, 2019 at 12:23 PM rahuledavalath@gmail.com <
> > > > > > > rahuledavalath@gmail.com> wrote:
> > > > > > >
> > > > > > >>
> > > > > > >>
> > > > > > >> On 2019/03/11 18:25:46, Vinoth Chandar <vi...@apache.org>
> > wrote:
> > > > > > >> > Hi Rahul,
> > > > > > >> >
> > > > > > >> > Hudi/Copy-on-write storage would keep expanding your existing
> > > > > parquet
> > > > > > >> files
> > > > > > >> > to reach the configured file size, once you set the small
> > file size
> > > > > > >> > config..
> > > > > > >> >
> > > > > > >> > For e.g: we at uber, write 1GB files this way.. to do that,
> > you
> > > > > could
> > > > > > >> set
> > > > > > >> > something like this.
> > > > > > >> > http://hudi.apache.org/configurations.html#limitFileSize  =
> > 1 *
> > > > > 1024 *
> > > > > > >> 1024
> > > > > > >> > * 1024
> > > > > > >> >
> > http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > > > > =
> > > > > > >> 900 *
> > > > > > >> > 1024 * 1024
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Please let me know if you have trouble achieving this. Also
> > please
> > > > > use
> > > > > > >> the
> > > > > > >> > insert operation (not bulk_insert) for this to work
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Thanks
> > > > > > >> > Vinoth
> > > > > > >> >
> > > > > > >> > On Mon, Mar 11, 2019 at 12:32 AM rahuledavalath@gmail.com <
> > > > > > >> > rahuledavalath@gmail.com> wrote:
> > > > > > >> >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar <vi...@apache.org>
> > wrote:
> > > > > > >> > > > Hi Rahul,
> > > > > > >> > > >
> > > > > > >> > > > you can try adding
> > hoodie.parquet.small.file.limit=104857600, to
> > > > > > >> your
> > > > > > >> > > > property file to specify 100MB files. Note that this
> > works only
> > > > > if
> > > > > > >> you
> > > > > > >> > > are
> > > > > > >> > > > using insert (not bulk_insert) operation. Hudi will
> > enforce file
> > > > > > >> sizing
> > > > > > >> > > on
> > > > > > >> > > > ingest time. As of now, there is no support for
> > collapsing these
> > > > > > >> file
> > > > > > >> > > > groups (parquet + related log files) into a large file
> > group
> > > > > > >> (HIP/Design
> > > > > > >> > > > may come soon). Does that help?
> > > > > > >> > > >
> > > > > > >> > > > Also on the compaction in general, since you don't have
> > any
> > > > > updates.
> > > > > > >> > > > I think you can simply use the copy_on_write storage?
> > inserts
> > > > > will
> > > > > > >> go to
> > > > > > >> > > > the parquet file anyway on MOR..(but if you like to be
> > able to
> > > > > deal
> > > > > > >> with
> > > > > > >> > > > updates later, understand where you are going)
> > > > > > >> > > >
> > > > > > >> > > > Thanks
> > > > > > >> > > > Vinoth
> > > > > > >> > > >
> > > > > > >> > > > On Fri, Mar 8, 2019 at 3:25 AM rahuledavalath@gmail.com <
> > > > > > >> > > > rahuledavalath@gmail.com> wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Dear All
> > > > > > >> > > > >
> > > > > > >> > > > > I am using DeltaStreamer to stream the data from kafka
> > topic
> > > > > and
> > > > > > >> to
> > > > > > >> > > write
> > > > > > >> > > > > it into the hudi data set.
> > > > > > >> > > > > For this use case I am not doing any upsert all are
> > insert
> > > > > only
> > > > > > >> so each
> > > > > > >> > > > > job creates new parquet file after the inject job. So
> > large
> > > > > > >> number of
> > > > > > >> > > > > small files are creating. how can i  merge these files
> > from
> > > > > > >> > > deltastreamer
> > > > > > >> > > > > job using the available configurations.
> > > > > > >> > > > >
> > > > > > >> > > > > I think compactionSmallFileSize may useful for this
> > case,
> > > > > but i
> > > > > > >> am not
> > > > > > >> > > > > sure whether it is for deltastreamer or not. I tried it
> > in
> > > > > > >> > > deltastreamer
> > > > > > >> > > > > but it did't worked. Please assist on this. If possible
> > give
> > > > > one
> > > > > > >> > > example
> > > > > > >> > > > > for the same
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks & Regards
> > > > > > >> > > > > Rahul
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > Dear Vinoth
> > > > > > >> > >
> > > > > > >> > > For one of my use case , I doing only inserts.For testing i
> > am
> > > > > > >> inserting
> > > > > > >> > > data which have 5-10 records only. I  am continuously
> > pushing
> > > > > data to
> > > > > > >> hudi
> > > > > > >> > > dataset. As it is insert only for every insert it's
> > creating  new
> > > > > > >> small
> > > > > > >> > > files to the dataset.
> > > > > > >> > >
> > > > > > >> > > If my insertion interval is less and i am planning for data
> > to
> > > > > keep
> > > > > > >> for
> > > > > > >> > > years, this flow will create lots of small files.
> > > > > > >> > > I just want to know whether hudi can merge these small
> > files in
> > > > > any
> > > > > > >> ways.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > Thanks & Regards
> > > > > > >> > > Rahul P
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >> Dear Vinoth
> > > > > > >>
> > > > > > >> I tried below configurations.
> > > > > > >>
> > > > > > >> hoodie.parquet.max.file.size=1073741824
> > > > > > >> hoodie.parquet.small.file.limit=943718400
> > > > > > >>
> > > > > > >> I am using below code for inserting data from json kafka source.
> > > > > > >>
> > > > > > >> spark-submit --class
> > > > > > >> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
> > > > > > >> hoodie-utilities-0.4.5.jar --storage-type COPY_ON_WRITE
> > --source-class
> > > > > > >> com.uber.hoodie.utilities.sources.JsonKafkaSource
> > > > > --source-ordering-field
> > > > > > >> stype  --target-base-path /MERGE --target-table MERGE --props
> > > > > > >> /hudi/kafka-source.properties  --schemaprovider-class
> > > > > > >> com.uber.hoodie.utilities.schema.FilebasedSchemaProvider --op
> > insert
> > > > > > >>
> > > > > > >> But for each insert job it's creating new parquet file. It's not
> > > > > touching
> > > > > > >> old parquet files.
> > > > > > >>
> > > > > > >> For reference i am  sharing  some of the parquet files of hudi
> > dataset
> > > > > > >> which are generating as part of DeltaStreamer data insertion.
> > > > > > >>
> > > > > > >> 93  /MERGE/2019/03/06/.hoodie_partition_metadata
> > > > > > >> 424.0 K
> > > > > > >>
> > > > >
> > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002655.parquet
> > > > > > >> 424.0 K
> > > > > > >>
> > > > >
> > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002733.parquet
> > > > > > >> 424.0 K
> > > > > > >>
> > > > >
> > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002754.parquet
> > > > > > >> 424.0 K
> > > > > > >>
> > > > >
> > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002815.parquet
> > > > > > >> 424.0 K
> > > > > > >>
> > > > >
> > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002837.parquet
> > > > > > >> 424.0 K
> > > > > > >>
> > > > >
> > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002859.parquet
> > > > > > >> 424.0 K
> > > > > > >>
> > > > >
> > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002921.parquet
> > > > > > >> 424.0 K
> > > > > > >>
> > > > >
> > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002942.parquet
> > > > > > >> 424.0 K
> > > > > > >>
> > > > >
> > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003003.parquet
> > > > > > >> 424.0 K
> > > > > > >>
> > > > >
> > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003024.parquet
> > > > > > >> 424.0 K
> > > > > > >>
> > > > >
> > /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003045.parquet
> > > > > > >>
> > > > > > >> Each job it's creating files of 424K & it's not merging any.
> > Can you
> > > > > > >> please confirm whether hudi can achieve the use case which i
> > > > > mentioned. If
> > > > > > >> this merging/compacting  feature is there, kindly tell what i am
> > > > > missing
> > > > > > >> here.
> > > > > > >>
> > > > > > >> Thanks & Regards
> > > > > > >> Rahul
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > > > Dear Vinoth
> > > > >
> > > > > I too verified that  the feature is kicking in.
> > > > > I am using below properties and my insert job is running with 10S
> > interval.
> > > > >
> > > > > hoodie.cleaner.commits.retained=6
> > > > > hoodie.keep.max.commits=6
> > > > > hoodie.keep.min.commits=3
> > > > > hoodie.parquet.small.file.limit=943718400
> > > > > hoodie.parquet.max.file.size=1073741824
> > > > > hoodie.compact.inline=false
> > > > >
> > > > > Now i can see about 180 files in the hudi data set with
> > > > > hoodie.compact.inline=false.
> > > > > hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
> > > > > 181
> > > > >
> > > > > if set hoodie.compact.inline=true
> > > > > i am getiing below error
> > > > >
> > > > >  Loaded instants [[20190313131254__clean__COMPLETED],
> > > > > [20190313131254__commit__COMPLETED],
> > [20190313131316__clean__COMPLETED],
> > > > > [20190313131316__commit__COMPLETED],
> > [20190313131339__clean__COMPLETED],
> > > > > [20190313131339__commit__COMPLETED],
> > [20190313131401__clean__COMPLETED],
> > > > > [20190313131401__commit__COMPLETED],
> > [20190313131423__clean__COMPLETED],
> > > > > [20190313131423__commit__COMPLETED],
> > [20190313131445__clean__COMPLETED],
> > > > > [20190313131445__commit__COMPLETED],
> > [20190313131512__commit__COMPLETED]]
> > > > > Exception in thread "main"
> > > > > com.uber.hoodie.exception.HoodieNotSupportedException: Compaction is
> > not
> > > > > supported from a CopyOnWrite table
> > > > >
> > > > >         at
> > > > >
> > com.uber.hoodie.table.HoodieCopyOnWriteTable.scheduleCompaction(HoodieCopyOnWriteTable.java:168)
> > > > >
> > > > >
> > > > > please assist on this.
> > > > >
> > > > > Thanks & Regards
> > > > > Rahul
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > Dear Vinod
> > >
> > > Previous mail i already  mentioned i am seeing more than 180 parquet
> > files.
> > > hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
> > > 181
> > >
> > > I given commit to retain as 6(hoodie.cleaner.commits.retained=6) only.
> > why then 181 files are coming. I am facing problem at this point.
> > >
> > > Thanks & Regards
> > > Rahul
> > >
> >
> > Dear Vinod
> >
> > Now also i am facing same issue on COW table.  I think the clear job will
> > invoke while spark-hudi loading time. But the old commit's  parquet files
> > are still there. it's not cleaning. Can you please assist on this.
> >
> > Thanks & Regards
> > Rahul
> >
> 

Dear Vinod

as per the config document i am using hoodie.cleaner.commits.retained=6 for my test case.
but i can see more than 6 parquet files for same file group. 

i too also think it's config issue. So please tell any other config is affecting this. Less size data only i am using to check this case.

while checking some code i have  found String CLEANER_FILE_VERSIONS_RETAINED_PROP =
      "hoodie.cleaner.fileversions" + ".retained";
it's not mentioned in config document.  So i tried hoodie.cleaner.fileversions.retained=6 also, but same issue.

sample files in my hudi location

434616 2019-04-03 19:32 /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193219.parquet
434722 2019-04-03 19:32 /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193240.parquet
434765 2019-04-03 19:33 /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193258.parquet
434855 2019-04-03 19:33 /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193319.parquet
434903 2019-04-03 19:33 /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193341.parquet
434954 2019-04-03 19:34 /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193400.parquet
434992 2019-04-03 19:35 /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193545.parquet
435030 2019-04-03 19:36 /loadhudi54/2019/03/16/a790b66f-4684-4600-8e8e-a8b97cf53eed_0_20190403193640.parquet


please assist on this. As it is a basic feature i am not able to go further.

Thanks & Regards
Rahul

Re: how to merge small parqut files in the hudi location

Posted by Vinoth Chandar <vi...@apache.org>.

Hi Rahul,

Sorry not following fully.. Are you saying cleaning is not triggered at all
or is cleaner not reclaiming older files? This definitely should be
working,. So its mostly some config issue

Thanks
Vinoth

On Wed, Apr 3, 2019 at 6:27 AM rahuledavalath@gmail.com <
rahuledavalath@gmail.com> wrote:

>
>
> On 2019/03/13 12:57:59, rahuledavalath@gmail.com <ra...@gmail.com>
> wrote:
> >
> >
> > On 2019/03/13 08:42:13, Vinoth Chandar <vi...@apache.org> wrote:
> > > Hi Rahul,
> > >
> > > Good to know. Yes for copy_on_write please turn off inline compaction.
> > > (Probably explains why the default was false).
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Wed, Mar 13, 2019 at 12:51 AM rahuledavalath@gmail.com <
> > > rahuledavalath@gmail.com> wrote:
> > >
> > > >
> > > >
> > > > On 2019/03/12 23:04:43, Vinoth Chandar <vi...@apache.org> wrote:
> > > > > Opened up https://github.com/uber/hudi/pull/599/files to improve
> this
> > > > > out-of-box
> > > > >
> > > > > On Tue, Mar 12, 2019 at 1:27 PM Vinoth Chandar <vi...@apache.org>
> > > > wrote:
> > > > >
> > > > > > Hi Rahul,
> > > > > >
> > > > > > The files you shared all belong to same file group (they share
> the same
> > > > > > prefix if you notice) (
> > > > https://hudi.apache.org/concepts.html#terminologies
> > > > > > ).
> > > > > > Given its not creating new file groups every run, means the
> feature is
> > > > > > kicking in.
> > > > > >
> > > > > > During each insert, Hudi will find the latest file in each file
> group
> > > > (I,e
> > > > > > the one with largest instant time, timestamp) and rewrite/expand
> that
> > > > with
> > > > > > the new inserts. Hudi does not clean up the old files
> immediately,
> > > > since
> > > > > > that can cause running queries to fail, since they could have
> started
> > > > even
> > > > > > hours ago (e.g Hive).
> > > > > >
> > > > > > If you want to reduce the number of files you see, you can lower
> > > > number of
> > > > > > commits retained
> > > > > > https://hudi.apache.org/configurations.html#retainCommits
> > > > > > We retain 24 by default.. i.e after the 25th file, the first one
> will
> > > > be
> > > > > > automatically cleaned..
> > > > > >
> > > > > > Does that make sense? Are you able to query this data and find
> the
> > > > > > expected records?
> > > > > >
> > > > > > Thanks
> > > > > > Vinoth
> > > > > >
> > > > > > On Tue, Mar 12, 2019 at 12:23 PM rahuledavalath@gmail.com <
> > > > > > rahuledavalath@gmail.com> wrote:
> > > > > >
> > > > > >>
> > > > > >>
> > > > > >> On 2019/03/11 18:25:46, Vinoth Chandar <vi...@apache.org>
> wrote:
> > > > > >> > Hi Rahul,
> > > > > >> >
> > > > > >> > Hudi/Copy-on-write storage would keep expanding your existing
> > > > parquet
> > > > > >> files
> > > > > >> > to reach the configured file size, once you set the small
> file size
> > > > > >> > config..
> > > > > >> >
> > > > > >> > For e.g: we at uber, write 1GB files this way.. to do that,
> you
> > > > could
> > > > > >> set
> > > > > >> > something like this.
> > > > > >> > http://hudi.apache.org/configurations.html#limitFileSize  =
> 1 *
> > > > 1024 *
> > > > > >> 1024
> > > > > >> > * 1024
> > > > > >> >
> http://hudi.apache.org/configurations.html#compactionSmallFileSize
> > > > =
> > > > > >> 900 *
> > > > > >> > 1024 * 1024
> > > > > >> >
> > > > > >> >
> > > > > >> > Please let me know if you have trouble achieving this. Also
> please
> > > > use
> > > > > >> the
> > > > > >> > insert operation (not bulk_insert) for this to work
> > > > > >> >
> > > > > >> >
> > > > > >> > Thanks
> > > > > >> > Vinoth
> > > > > >> >
> > > > > >> > On Mon, Mar 11, 2019 at 12:32 AM rahuledavalath@gmail.com <
> > > > > >> > rahuledavalath@gmail.com> wrote:
> > > > > >> >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > On 2019/03/08 13:43:52, Vinoth Chandar <vi...@apache.org>
> wrote:
> > > > > >> > > > Hi Rahul,
> > > > > >> > > >
> > > > > >> > > > you can try adding
> hoodie.parquet.small.file.limit=104857600, to
> > > > > >> your
> > > > > >> > > > property file to specify 100MB files. Note that this
> works only
> > > > if
> > > > > >> you
> > > > > >> > > are
> > > > > >> > > > using insert (not bulk_insert) operation. Hudi will
> enforce file
> > > > > >> sizing
> > > > > >> > > on
> > > > > >> > > > ingest time. As of now, there is no support for
> collapsing these
> > > > > >> file
> > > > > >> > > > groups (parquet + related log files) into a large file
> group
> > > > > >> (HIP/Design
> > > > > >> > > > may come soon). Does that help?
> > > > > >> > > >
> > > > > >> > > > Also on the compaction in general, since you don't have
> any
> > > > updates.
> > > > > >> > > > I think you can simply use the copy_on_write storage?
> inserts
> > > > will
> > > > > >> go to
> > > > > >> > > > the parquet file anyway on MOR..(but if you like to be
> able to
> > > > deal
> > > > > >> with
> > > > > >> > > > updates later, understand where you are going)
> > > > > >> > > >
> > > > > >> > > > Thanks
> > > > > >> > > > Vinoth
> > > > > >> > > >
> > > > > >> > > > On Fri, Mar 8, 2019 at 3:25 AM rahuledavalath@gmail.com <
> > > > > >> > > > rahuledavalath@gmail.com> wrote:
> > > > > >> > > >
> > > > > >> > > > > Dear All
> > > > > >> > > > >
> > > > > >> > > > > I am using DeltaStreamer to stream the data from kafka
> topic
> > > > and
> > > > > >> to
> > > > > >> > > write
> > > > > >> > > > > it into the hudi data set.
> > > > > >> > > > > For this use case I am not doing any upsert all are
> insert
> > > > only
> > > > > >> so each
> > > > > >> > > > > job creates new parquet file after the inject job. So
> large
> > > > > >> number of
> > > > > >> > > > > small files are creating. how can i  merge these files
> from
> > > > > >> > > deltastreamer
> > > > > >> > > > > job using the available configurations.
> > > > > >> > > > >
> > > > > >> > > > > I think compactionSmallFileSize may useful for this
> case,
> > > > but i
> > > > > >> am not
> > > > > >> > > > > sure whether it is for deltastreamer or not. I tried it
> in
> > > > > >> > > deltastreamer
> > > > > >> > > > > but it did't worked. Please assist on this. If possible
> give
> > > > one
> > > > > >> > > example
> > > > > >> > > > > for the same
> > > > > >> > > > >
> > > > > >> > > > > Thanks & Regards
> > > > > >> > > > > Rahul
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Dear Vinoth
> > > > > >> > >
> > > > > >> > > For one of my use case , I doing only inserts.For testing i
> am
> > > > > >> inserting
> > > > > >> > > data which have 5-10 records only. I  am continuously
> pushing
> > > > data to
> > > > > >> hudi
> > > > > >> > > dataset. As it is insert only for every insert it's
> creating  new
> > > > > >> small
> > > > > >> > > files to the dataset.
> > > > > >> > >
> > > > > >> > > If my insertion interval is less and i am planning for data
> to
> > > > keep
> > > > > >> for
> > > > > >> > > years, this flow will create lots of small files.
> > > > > >> > > I just want to know whether hudi can merge these small
> files in
> > > > any
> > > > > >> ways.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Thanks & Regards
> > > > > >> > > Rahul P
> > > > > >> > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >> Dear Vinoth
> > > > > >>
> > > > > >> I tried below configurations.
> > > > > >>
> > > > > >> hoodie.parquet.max.file.size=1073741824
> > > > > >> hoodie.parquet.small.file.limit=943718400
> > > > > >>
> > > > > >> I am using below code for inserting data from json kafka source.
> > > > > >>
> > > > > >> spark-submit --class
> > > > > >> com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer
> > > > > >> hoodie-utilities-0.4.5.jar --storage-type COPY_ON_WRITE
> --source-class
> > > > > >> com.uber.hoodie.utilities.sources.JsonKafkaSource
> > > > --source-ordering-field
> > > > > >> stype  --target-base-path /MERGE --target-table MERGE --props
> > > > > >> /hudi/kafka-source.properties  --schemaprovider-class
> > > > > >> com.uber.hoodie.utilities.schema.FilebasedSchemaProvider --op
> insert
> > > > > >>
> > > > > >> But for each insert job it's creating new parquet file. It's not
> > > > touching
> > > > > >> old parquet files.
> > > > > >>
> > > > > >> For reference i am  sharing  some of the parquet files of hudi
> dataset
> > > > > >> which are generating as part of DeltaStreamer data insertion.
> > > > > >>
> > > > > >> 93  /MERGE/2019/03/06/.hoodie_partition_metadata
> > > > > >> 424.0 K
> > > > > >>
> > > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002655.parquet
> > > > > >> 424.0 K
> > > > > >>
> > > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002733.parquet
> > > > > >> 424.0 K
> > > > > >>
> > > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002754.parquet
> > > > > >> 424.0 K
> > > > > >>
> > > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002815.parquet
> > > > > >> 424.0 K
> > > > > >>
> > > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002837.parquet
> > > > > >> 424.0 K
> > > > > >>
> > > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002859.parquet
> > > > > >> 424.0 K
> > > > > >>
> > > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002921.parquet
> > > > > >> 424.0 K
> > > > > >>
> > > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312002942.parquet
> > > > > >> 424.0 K
> > > > > >>
> > > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003003.parquet
> > > > > >> 424.0 K
> > > > > >>
> > > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003024.parquet
> > > > > >> 424.0 K
> > > > > >>
> > > >
> /MERGE/2019/03/06/1e9735d2-2057-40c6-a4df-078eb297a298_0_20190312003045.parquet
> > > > > >>
> > > > > >> Each job it's creating files of 424K & it's not merging any.
> Can you
> > > > > >> please confirm whether hudi can achieve the use case which i
> > > > mentioned. If
> > > > > >> this merging/compacting  feature is there, kindly tell what i am
> > > > missing
> > > > > >> here.
> > > > > >>
> > > > > >> Thanks & Regards
> > > > > >> Rahul
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > > > Dear Vinoth
> > > >
> > > > I too verified that  the feature is kicking in.
> > > > I am using below properties and my insert job is running with 10S
> interval.
> > > >
> > > > hoodie.cleaner.commits.retained=6
> > > > hoodie.keep.max.commits=6
> > > > hoodie.keep.min.commits=3
> > > > hoodie.parquet.small.file.limit=943718400
> > > > hoodie.parquet.max.file.size=1073741824
> > > > hoodie.compact.inline=false
> > > >
> > > > Now i can see about 180 files in the hudi data set with
> > > > hoodie.compact.inline=false.
> > > > hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
> > > > 181
> > > >
> > > > if set hoodie.compact.inline=true
> > > > i am getiing below error
> > > >
> > > >  Loaded instants [[20190313131254__clean__COMPLETED],
> > > > [20190313131254__commit__COMPLETED],
> [20190313131316__clean__COMPLETED],
> > > > [20190313131316__commit__COMPLETED],
> [20190313131339__clean__COMPLETED],
> > > > [20190313131339__commit__COMPLETED],
> [20190313131401__clean__COMPLETED],
> > > > [20190313131401__commit__COMPLETED],
> [20190313131423__clean__COMPLETED],
> > > > [20190313131423__commit__COMPLETED],
> [20190313131445__clean__COMPLETED],
> > > > [20190313131445__commit__COMPLETED],
> [20190313131512__commit__COMPLETED]]
> > > > Exception in thread "main"
> > > > com.uber.hoodie.exception.HoodieNotSupportedException: Compaction is
> not
> > > > supported from a CopyOnWrite table
> > > >
> > > >         at
> > > >
> com.uber.hoodie.table.HoodieCopyOnWriteTable.scheduleCompaction(HoodieCopyOnWriteTable.java:168)
> > > >
> > > >
> > > > please assist on this.
> > > >
> > > > Thanks & Regards
> > > > Rahul
> > > >
> > > >
> > >
> >
> >
> >
> > Dear Vinod
> >
> > Previous mail i already  mentioned i am seeing more than 180 parquet
> files.
> > hadoop fs -ls   /MERGE/2019/03/14/* | wc -l
> > 181
> >
> > I given commit to retain as 6(hoodie.cleaner.commits.retained=6) only.
> why then 181 files are coming. I am facing problem at this point.
> >
> > Thanks & Regards
> > Rahul
> >
>
> Dear Vinod
>
> Now also i am facing same issue on COW table.  I think the clear job will
> invoke while spark-hudi loading time. But the old commit's  parquet files
> are still there. it's not cleaning. Can you please assist on this.
>
> Thanks & Regards
> Rahul
>