You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by kaka chen <ka...@gmail.com> on 2019/02/27 02:52:08 UTC

Insert will generate at least one file each time when each spark or spark streaming batch?

Hi All,

I found Insert will generate at least one file each time when each spark or
spark streaming batch.
Is it expected result? If it is, how to control these small files, is hudi
provide some tools to compact it?

Thanks,
Frank

Re: Insert will generate at least one file each time when each spark or spark streaming batch?

Posted by kaka chen <ka...@gmail.com>.

Nishith,

Thanks, will try it.

Thanks,
Frank

nishith agarwal <n3...@gmail.com> 于2019年3月12日周二 上午11:21写道：

> Frank,
>
> You can play with a couple of configs to keep X number of older file
> versions. Take a look at these configs :
> https://hudi.apache.org/configurations.html#withCompactionConfig.
> Specifically, you can choose the number of commits you want to keep, here
> commits = versions.
> Depending on how long your query runs, you might want to keep the older
> data file for a configured amount of time after which it will be cleaned.
>
> Thanks,
> Nishith
>
> On Mon, Mar 11, 2019 at 7:42 PM kaka chen <ka...@gmail.com> wrote:
>
> > Hi Vinoth,
> >
> > To use this feature, I find the new file will write a new file with old
> > inserted records.
> > But how to cleanup the old files when use cow tables?
> >
> > Thanks,
> > Frank
> >
> > Vinoth Chandar <vi...@apache.org> 于2019年2月28日周四 上午3:24写道：
> >
> > > Similarly, please try the 0.4.5 release. This has small file handling
> > > turned on by default..
> > >
> > > Also please use the insert api/operation, (not bulk_insert) if you want
> > > this behavior.
> > >
> > > Let us know if you still run into issues..
> > >
> > > On Tue, Feb 26, 2019 at 11:09 PM kaka chen <ka...@gmail.com>
> > wrote:
> > >
> > > > Thanks!
> > > >
> > > > nishith agarwal <n3...@gmail.com> 于2019年2月27日周三 下午2:56写道：
> > > >
> > > > > Hi Kaka,
> > > > >
> > > > > Hudi automatically does file sizing for you. As you ingest more
> > inserts
> > > > the
> > > > > existing file will be automatically sized. You can play with a few
> > > > configs
> > > > > :
> > > > >
> > > > > https://hudi.apache.org/configurations.html#withStorageConfig ->
> > This
> > > > > config allows you to set a max size for your output file.
> > > > >
> https://hudi.apache.org/configurations.html#compactionSmallFileSize
> > ->
> > > > > This
> > > > > config allows you to set a minimum file size that will be
> > automatically
> > > > > sized.
> > > > >
> > > > > As you can guess, the limitFileSize >= compactionFileSize.
> > > > > Hope this helps.
> > > > >
> > > > > Thanks,
> > > > > Nishith
> > > > >
> > > > > On Tue, Feb 26, 2019 at 6:52 PM kaka chen <ka...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > I found Insert will generate at least one file each time when
> each
> > > > spark
> > > > > or
> > > > > > spark streaming batch.
> > > > > > Is it expected result? If it is, how to control these small
> files,
> > is
> > > > > hudi
> > > > > > provide some tools to compact it?
> > > > > >
> > > > > > Thanks,
> > > > > > Frank
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Insert will generate at least one file each time when each spark or spark streaming batch?

Posted by nishith agarwal <n3...@gmail.com>.

Frank,

You can play with a couple of configs to keep X number of older file
versions. Take a look at these configs :
https://hudi.apache.org/configurations.html#withCompactionConfig.
Specifically, you can choose the number of commits you want to keep, here
commits = versions.
Depending on how long your query runs, you might want to keep the older
data file for a configured amount of time after which it will be cleaned.

Thanks,
Nishith

On Mon, Mar 11, 2019 at 7:42 PM kaka chen <ka...@gmail.com> wrote:

> Hi Vinoth,
>
> To use this feature, I find the new file will write a new file with old
> inserted records.
> But how to cleanup the old files when use cow tables?
>
> Thanks,
> Frank
>
> Vinoth Chandar <vi...@apache.org> 于2019年2月28日周四 上午3:24写道：
>
> > Similarly, please try the 0.4.5 release. This has small file handling
> > turned on by default..
> >
> > Also please use the insert api/operation, (not bulk_insert) if you want
> > this behavior.
> >
> > Let us know if you still run into issues..
> >
> > On Tue, Feb 26, 2019 at 11:09 PM kaka chen <ka...@gmail.com>
> wrote:
> >
> > > Thanks!
> > >
> > > nishith agarwal <n3...@gmail.com> 于2019年2月27日周三 下午2:56写道：
> > >
> > > > Hi Kaka,
> > > >
> > > > Hudi automatically does file sizing for you. As you ingest more
> inserts
> > > the
> > > > existing file will be automatically sized. You can play with a few
> > > configs
> > > > :
> > > >
> > > > https://hudi.apache.org/configurations.html#withStorageConfig ->
> This
> > > > config allows you to set a max size for your output file.
> > > > https://hudi.apache.org/configurations.html#compactionSmallFileSize
> ->
> > > > This
> > > > config allows you to set a minimum file size that will be
> automatically
> > > > sized.
> > > >
> > > > As you can guess, the limitFileSize >= compactionFileSize.
> > > > Hope this helps.
> > > >
> > > > Thanks,
> > > > Nishith
> > > >
> > > > On Tue, Feb 26, 2019 at 6:52 PM kaka chen <ka...@gmail.com>
> > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I found Insert will generate at least one file each time when each
> > > spark
> > > > or
> > > > > spark streaming batch.
> > > > > Is it expected result? If it is, how to control these small files,
> is
> > > > hudi
> > > > > provide some tools to compact it?
> > > > >
> > > > > Thanks,
> > > > > Frank
> > > > >
> > > >
> > >
> >
>

Re: Insert will generate at least one file each time when each spark or spark streaming batch?

Posted by kaka chen <ka...@gmail.com>.

Hi Vinoth,

To use this feature, I find the new file will write a new file with old
inserted records.
But how to cleanup the old files when use cow tables?

Thanks,
Frank

Vinoth Chandar <vi...@apache.org> 于2019年2月28日周四 上午3:24写道：

> Similarly, please try the 0.4.5 release. This has small file handling
> turned on by default..
>
> Also please use the insert api/operation, (not bulk_insert) if you want
> this behavior.
>
> Let us know if you still run into issues..
>
> On Tue, Feb 26, 2019 at 11:09 PM kaka chen <ka...@gmail.com> wrote:
>
> > Thanks!
> >
> > nishith agarwal <n3...@gmail.com> 于2019年2月27日周三 下午2:56写道：
> >
> > > Hi Kaka,
> > >
> > > Hudi automatically does file sizing for you. As you ingest more inserts
> > the
> > > existing file will be automatically sized. You can play with a few
> > configs
> > > :
> > >
> > > https://hudi.apache.org/configurations.html#withStorageConfig -> This
> > > config allows you to set a max size for your output file.
> > > https://hudi.apache.org/configurations.html#compactionSmallFileSize ->
> > > This
> > > config allows you to set a minimum file size that will be automatically
> > > sized.
> > >
> > > As you can guess, the limitFileSize >= compactionFileSize.
> > > Hope this helps.
> > >
> > > Thanks,
> > > Nishith
> > >
> > > On Tue, Feb 26, 2019 at 6:52 PM kaka chen <ka...@gmail.com>
> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I found Insert will generate at least one file each time when each
> > spark
> > > or
> > > > spark streaming batch.
> > > > Is it expected result? If it is, how to control these small files, is
> > > hudi
> > > > provide some tools to compact it?
> > > >
> > > > Thanks,
> > > > Frank
> > > >
> > >
> >
>

Re: Insert will generate at least one file each time when each spark or spark streaming batch?

Posted by Vinoth Chandar <vi...@apache.org>.

Similarly, please try the 0.4.5 release. This has small file handling
turned on by default..

Also please use the insert api/operation, (not bulk_insert) if you want
this behavior.

Let us know if you still run into issues..

On Tue, Feb 26, 2019 at 11:09 PM kaka chen <ka...@gmail.com> wrote:

> Thanks!
>
> nishith agarwal <n3...@gmail.com> 于2019年2月27日周三 下午2:56写道：
>
> > Hi Kaka,
> >
> > Hudi automatically does file sizing for you. As you ingest more inserts
> the
> > existing file will be automatically sized. You can play with a few
> configs
> > :
> >
> > https://hudi.apache.org/configurations.html#withStorageConfig -> This
> > config allows you to set a max size for your output file.
> > https://hudi.apache.org/configurations.html#compactionSmallFileSize ->
> > This
> > config allows you to set a minimum file size that will be automatically
> > sized.
> >
> > As you can guess, the limitFileSize >= compactionFileSize.
> > Hope this helps.
> >
> > Thanks,
> > Nishith
> >
> > On Tue, Feb 26, 2019 at 6:52 PM kaka chen <ka...@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > I found Insert will generate at least one file each time when each
> spark
> > or
> > > spark streaming batch.
> > > Is it expected result? If it is, how to control these small files, is
> > hudi
> > > provide some tools to compact it?
> > >
> > > Thanks,
> > > Frank
> > >
> >
>

Re: Insert will generate at least one file each time when each spark or spark streaming batch?

Posted by kaka chen <ka...@gmail.com>.

Thanks!

nishith agarwal <n3...@gmail.com> 于2019年2月27日周三 下午2:56写道：

> Hi Kaka,
>
> Hudi automatically does file sizing for you. As you ingest more inserts the
> existing file will be automatically sized. You can play with a few configs
> :
>
> https://hudi.apache.org/configurations.html#withStorageConfig -> This
> config allows you to set a max size for your output file.
> https://hudi.apache.org/configurations.html#compactionSmallFileSize ->
> This
> config allows you to set a minimum file size that will be automatically
> sized.
>
> As you can guess, the limitFileSize >= compactionFileSize.
> Hope this helps.
>
> Thanks,
> Nishith
>
> On Tue, Feb 26, 2019 at 6:52 PM kaka chen <ka...@gmail.com> wrote:
>
> > Hi All,
> >
> > I found Insert will generate at least one file each time when each spark
> or
> > spark streaming batch.
> > Is it expected result? If it is, how to control these small files, is
> hudi
> > provide some tools to compact it?
> >
> > Thanks,
> > Frank
> >
>

Re: Insert will generate at least one file each time when each spark or spark streaming batch?

Posted by nishith agarwal <n3...@gmail.com>.

Hi Kaka,

Hudi automatically does file sizing for you. As you ingest more inserts the
existing file will be automatically sized. You can play with a few configs
:

https://hudi.apache.org/configurations.html#withStorageConfig -> This
config allows you to set a max size for your output file.
https://hudi.apache.org/configurations.html#compactionSmallFileSize -> This
config allows you to set a minimum file size that will be automatically
sized.

As you can guess, the limitFileSize >= compactionFileSize.
Hope this helps.

Thanks,
Nishith

On Tue, Feb 26, 2019 at 6:52 PM kaka chen <ka...@gmail.com> wrote:

> Hi All,
>
> I found Insert will generate at least one file each time when each spark or
> spark streaming batch.
> Is it expected result? If it is, how to control these small files, is hudi
> provide some tools to compact it?
>
> Thanks,
> Frank
>