You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by OpenInx <op...@gmail.com> on 2020/01/07 06:58:13 UTC

Any plan to support hudi running on S3 ?

Hi

I know that the Hudi can only running on HDFS,  while s3 is much cheaper
than hdfs storage.  so just ask:
any plan to support hudi + s3, say make the storage layer to be pluggable ,
and have the hdfs/s3/oss impl etc.
btw  do we hudi have the strong dependency on the atomic rename semantic ?
If sure, it may be some problem
to move it to s3 ?

Thanks.

Re: Any plan to support hudi running on S3 ?

Posted by Pratyaksh Sharma <pr...@gmail.com>.

Hi Openlnx,

+1 to what Syed has mentioned. Additionally, the meta files are created and
maintained in the same file system in which you are creating your Hudi
dataset. So there is no hard dependency of meta files on HDFS.

Also to answer your question, every commit meta file in .hoodie folder is a
small file which does not go well with DFS. To take care of this, Hudi
archives older commits into a sequential log. You can go through the
following links to get a better understanding.

1. https://hudi.apache.org/configurations.html#compaction-configs.
2.
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoItoavoidcreatingtonsofsmallfiles
.
3. https://hudi.apache.org/use_cases.html#near-real-time-ingestion.

On Tue, Jan 7, 2020 at 4:45 PM Syed Abdul Kather <in...@gmail.com> wrote:

> On Tue, Jan 7, 2020, 14:20 OpenInx <op...@gmail.com> wrote:
>
> > Well, I understand your point. you mean all data files and delta log can
> be
> > stored on s3.
> > Make sense.
> >
> > but seems the hudi meta files is still depending on HDFS, and almost all
> of
> > them are small
> > files, so how do we limit the small file count (which may pressure the
> HDFS
> > namenode) ?
> >
> We are heavily using hudi in production (Spark + S3). HDFS is just a
> filesystem like other s3,etc . If spark supports any filesystem then hudi
> will also work in case if you have issues. Please write back to the
> community. Hudi doesn't have any hardcore dependencies with hdfs. Hope this
> helps.
>
> >
> > Thanks.
> >
> > On Tue, Jan 7, 2020 at 4:25 PM Syed Abdul Kather <in...@gmail.com>
> > wrote:
> >
> > > Hi,
> > > Hudi can run in spark. Storage doesn't matter. If you configure ur
> spark
> > > use s3 filesystem that is enough.
> > >             Thanks and Regards,
> > >         S SYED ABDUL KATHER
> > >
> > >
> > >
> > > On Tue, Jan 7, 2020 at 12:28 PM OpenInx <op...@gmail.com> wrote:
> > >
> > > > Hi
> > > >
> > > > I know that the Hudi can only running on HDFS,  while s3 is much
> > cheaper
> > > > than hdfs storage.  so just ask:
> > > > any plan to support hudi + s3, say make the storage layer to be
> > > pluggable ,
> > > > and have the hdfs/s3/oss impl etc.
> > > > btw  do we hudi have the strong dependency on the atomic rename
> > semantic
> > > ?
> > > > If sure, it may be some problem
> > > > to move it to s3 ?
> > > >
> > > > Thanks.
> > > >
> > >
> >
>

Re: Any plan to support hudi running on S3 ?

Posted by Syed Abdul Kather <in...@gmail.com>.

On Tue, Jan 7, 2020, 14:20 OpenInx <op...@gmail.com> wrote:

> Well, I understand your point. you mean all data files and delta log can be
> stored on s3.
> Make sense.
>
> but seems the hudi meta files is still depending on HDFS, and almost all of
> them are small
> files, so how do we limit the small file count (which may pressure the HDFS
> namenode) ?
>
We are heavily using hudi in production (Spark + S3). HDFS is just a
filesystem like other s3,etc . If spark supports any filesystem then hudi
will also work in case if you have issues. Please write back to the
community. Hudi doesn't have any hardcore dependencies with hdfs. Hope this
helps.

>
> Thanks.
>
> On Tue, Jan 7, 2020 at 4:25 PM Syed Abdul Kather <in...@gmail.com>
> wrote:
>
> > Hi,
> > Hudi can run in spark. Storage doesn't matter. If you configure ur spark
> > use s3 filesystem that is enough.
> >             Thanks and Regards,
> >         S SYED ABDUL KATHER
> >
> >
> >
> > On Tue, Jan 7, 2020 at 12:28 PM OpenInx <op...@gmail.com> wrote:
> >
> > > Hi
> > >
> > > I know that the Hudi can only running on HDFS,  while s3 is much
> cheaper
> > > than hdfs storage.  so just ask:
> > > any plan to support hudi + s3, say make the storage layer to be
> > pluggable ,
> > > and have the hdfs/s3/oss impl etc.
> > > btw  do we hudi have the strong dependency on the atomic rename
> semantic
> > ?
> > > If sure, it may be some problem
> > > to move it to s3 ?
> > >
> > > Thanks.
> > >
> >
>

Re: Any plan to support hudi running on S3 ?

Posted by OpenInx <op...@gmail.com>.

Well, I understand your point. you mean all data files and delta log can be
stored on s3.
Make sense.

but seems the hudi meta files is still depending on HDFS, and almost all of
them are small
files, so how do we limit the small file count (which may pressure the HDFS
namenode) ?

Thanks.

On Tue, Jan 7, 2020 at 4:25 PM Syed Abdul Kather <in...@gmail.com> wrote:

> Hi,
> Hudi can run in spark. Storage doesn't matter. If you configure ur spark
> use s3 filesystem that is enough.
>             Thanks and Regards,
>         S SYED ABDUL KATHER
>
>
>
> On Tue, Jan 7, 2020 at 12:28 PM OpenInx <op...@gmail.com> wrote:
>
> > Hi
> >
> > I know that the Hudi can only running on HDFS,  while s3 is much cheaper
> > than hdfs storage.  so just ask:
> > any plan to support hudi + s3, say make the storage layer to be
> pluggable ,
> > and have the hdfs/s3/oss impl etc.
> > btw  do we hudi have the strong dependency on the atomic rename semantic
> ?
> > If sure, it may be some problem
> > to move it to s3 ?
> >
> > Thanks.
> >
>

Re: Any plan to support hudi running on S3 ?

Posted by Syed Abdul Kather <in...@gmail.com>.

Hi,
Hudi can run in spark. Storage doesn't matter. If you configure ur spark
use s3 filesystem that is enough.
            Thanks and Regards,
        S SYED ABDUL KATHER



On Tue, Jan 7, 2020 at 12:28 PM OpenInx <op...@gmail.com> wrote:

> Hi
>
> I know that the Hudi can only running on HDFS,  while s3 is much cheaper
> than hdfs storage.  so just ask:
> any plan to support hudi + s3, say make the storage layer to be pluggable ,
> and have the hdfs/s3/oss impl etc.
> btw  do we hudi have the strong dependency on the atomic rename semantic ?
> If sure, it may be some problem
> to move it to s3 ?
>
> Thanks.
>