You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Vinoth Chandar <vi...@apache.org> on 2021/10/14 20:23:33 UTC

Re: Difference/compatibility between original Parquet files and Hudi modified Parquet files

Sorry, dropped the ball on this.

If you do the following, your queries will be correct and not see any
duplicates/partial data.

- For Spark, you need to now do spark.read.format("hudi").load()
- For Presto/Trino, when you sync the table metadata out to Hive
metastores, Presto/Trino understand Hudi tables natively and can filter the
correct snapshot.

In general, while copying data, you. need to copy both data files and the
.hoodie folder
You can also explore the incremental query in Hudi, to simplify this
process.
The delta streamer tool can do this for you. i.e read incrementally from
table 1 in bucket 1 and then transform it optionally, and transactionally
write to another table 2 in bucket 2.

Hope that helps. Otherwise, your plan seems good to go!

On Mon, Sep 27, 2021 at 2:55 PM Xiong Qiang <wi...@gmail.com>
wrote:

> Hi Vinoth,
>
> Thank you very much for the detailed explanation. That is very helpful.
>
> For the downstream applications, we have Spark applications, and
> Presto/Trino.
> For Spark, We use spark.read.format('parquet').load() to read the Parquet
> files for other processing.
> For Presto/Trino, we use AWS Lambda to add or drop the partitions in a
> daily schedule.
>
> Our use case of Hudi is mainly on the Copy On Write mode. We load Parquet
> files from S3 bucket1 in Hudi, do selection and deletion, and then write to
> S3 bucket2.
> Our downstream application read from either S3 bucket1, or S3 bucket2, but
> not both. *My understanding is that, this use case will not create a
> duplication issue. Is that correct?*
>
> In the  future, we may consider combining S3 bucket2 into S3 bucket1. We
> plan to first delete the old/original Parquet file in bucket1, drop the
> partitions as necessary in Presto/Trino, and then copy the modified Parquet
> (i.e. output Parquet from Hudi dataset) from bucket2 to bucket1. *Will that
> result in duplicate (if we are on Copy On Write mode)?*
>
> Besides the potential duplicate, any other pitfall that I need to pay
> special attention to?
>
> Thanks a lot!
> Xiong
>
>
>
> On Wed, Sep 22, 2021 at 2:29 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi,
> >
> > There is no format difference whatsoever. Hudi just adds additional
> footers
> > for min, max key values and bloom filters to parquet and some meta fields
> > for tracking commit times for incremental queries and keys.
> > Any standard parquet reader can read the parquet files in a Hudi table.
> > These downstream applications, are these Spark jobs? what do you use to
> > consume the parquet files?
> >
> > The main thing your downstream reader needs to do is to read a correct
> > snapshot i.e only the latest committed files. Otherwise,you may end up
> with
> > duplicate values.
> > For example, when you issue the hudi delete, hudi will internally create
> a
> > new version of parquet files, without the deleted rows. So if you are not
> > careful about filtering for the latest file, you may end up reading both
> > files and have duplicates
> >
> > All of this happens automatically, if you are using a supported engine
> like
> > spark, flink, hive, presto, trino, ...
> >
> > yes. hudi (copy on write) dataset is a set of parquet files, with some
> > metadata.
> >
> > Hope that helps
> >
> > Thanks
> > Vinoth
> >
> >
> >
> > Thanks
> > Vinoth
> >
> > On Fri, Sep 17, 2021 at 9:09 PM Xiong Qiang <wi...@gmail.com>
> > wrote:
> >
> > > Hi, all,
> > >
> > > I am new to Hudi, so please forgive me for naive questions.
> > >
> > > I was following the guides at
> > >
> > >
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
> > > and at https://hudi.incubator.apache.org/docs/quick-start-guide/.
> > >
> > > My goal is to load original Parquet files (written by Spark application
> > > from Kafka to S3) into Hudi, delete some rows, and then save back to (a
> > > different path in) S3 (the modified Parquet file). There are other
> > > downstream applications that consumes the original Parquet files for
> > > further processing.
> > >
> > > My question: *Is there any format difference between the original
> Parquet
> > > files and the Hudi modified Parquet files?* Is the Hudi modified
> Parquet
> > > files compatible with the original Parquet files? In other words, will
> > > other downstream applications (previously consuming the original
> Parquet
> > > files) be able to consume the modified Parquet files (i.e. the Hudi
> > > dataset) without any code change?
> > >
> > > In the docs, I have seen the phrase "Hudi dataset", which, in my
> > > understanding, is simply a Parquet file with accompanied Hudi
> metadata. I
> > > have also read the migration doc (
> > > https://hudi.incubator.apache.org/docs/migration_guide/). My
> > understanding
> > > is that we can migrate from original Parquet file to Hudi dataset (Hudi
> > > modified Parquet file). *Can we use (or migrate) Hudi dataset (Hudi
> > > modified Parquet file) back to original Parquet file to be consumed by
> > > other downstream application?*
> > >
> > > Thank you very much!
> > >
> >
>