You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hudi.apache.org by Raghvendra Dhar Dubey <ra...@delhivery.com.INVALID> on 2020/02/12 11:16:27 UTC

Apache Hudi on AWS EMR

Hi Team,

I want to setup incremental view of my AWS S3 parquet data through Apache
Hudi, and want to query this data through Athena, but currently Athena not
supporting Hudi Dataset.

so there are few questions which I want to understand here

1 - How to stream s3 parquet file to Hudi dataset running on EMR.

2 - How to query Hudi Dataset running on EMR

Please help me to understand this.

Thanks

Raghvendra

Re: Apache Hudi on AWS EMR

Posted by Vinoth Chandar <vi...@apache.org>.
https://issues.apache.org/jira/browse/HUDI-648 Filed to track error
tables..

Please ping on the ticket, if anyone is interested in picking it up.

On Fri, Feb 28, 2020 at 4:58 AM Raghvendra Dhar Dubey
<ra...@delhivery.com.invalid> wrote:

> Hi Udit,
>
> I tried Hudi version 0.5.1, and it worked fine, this issue was appeared
> with Hudi 0.5.0. other EMR related issues has been discussed with Rahul.
> Thanks to all of you for cooperation.
>
> Thanks
> Raghvendra
>
> On Fri, Feb 28, 2020 at 5:34 AM Mehrotra, Udit <ud...@amazon.com> wrote:
>
> > Raghvendra,
> >
> > Can you enable TRACE level logging for Hudi on EMR, and provide the error
> > logs. For this go to /etc/spark/conf/log4j.properties and change logging
> > level of log4j.logger.org.apache.hudi to TRACE. This would help provide
> the
> > failed record/keys based off
> >
> https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L287
> >
> > Another thing that would help is to provide the Avro schema that gets
> > printed on the driver when you run your job. We need to understand which
> > field and why it is treated as INT96, because current parquet-avro does
> not
> > handle its conversion. Also, for any other questions about EMR we can
> > discuss it in the meeting you have setup with Rahul from EMR team.
> >
> > Thanks,
> > Udit
> >
> > On 2/27/20, 11:00 AM, "Shiyan Xu" <xu...@gmail.com> wrote:
> >
> >     +1 on the idea. Giving an config like `--error-path` where all failed
> >     conversions are saved provides flexibility for later processing.
> > SQS/SNS
> >     can pick that up later.
> >
> >     On Thu, Feb 27, 2020 at 8:10 AM Vinoth Chandar <vi...@apache.org>
> > wrote:
> >
> >     > On the second part, it seems like a question for EMR folks ?
> >     >
> >     > Hudi's RDD level APIs, do hand the failure records back and .. May
> > be we
> >     > should consider writing out the error records somewhere for the
> > datasource
> >     > as well.?
> >     > others any thoughts?
> >     >
> >     > On Mon, Feb 24, 2020 at 10:59 PM Raghvendra Dhar Dubey
> >     > <ra...@delhivery.com.invalid> wrote:
> >     >
> >     > > Thanks Gary and Udit,
> >     > >
> >     > > I tried HudiDeltaStreamer for reading parquet files from s3  but
> > there is
> >     > > an issue while AvroSchemaConverter not able to convert Parquet
> > INT96. so
> >     > I
> >     > > thought to use Spark Structured Streaming to read data from s3
> and
> > write
> >     > > into Hudi, but as Databricks providing "cloudfiles" for failure
> > handling,
> >     > > Is there something in EMR? or do we need to manually handle this
> > failure
> >     > by
> >     > > introducing SQS and SNS?
> >     > >
> >     > >
> >     > >
> >     > > On 2020/02/18 20:03:16, "Mehrotra, Udit"
> <uditme@amazon.com.INVALID
> > >
> >     > > wrote:
> >     > > > Workaround provided by Gary can help querying Hudi tables
> through
> >     > Athena
> >     > > for Copy On Write tables by basically querying only the latest
> > commit
> >     > files
> >     > > as standard parquet. It would definitely be worth documenting, as
> > several
> >     > > people have asked for it and I remember providing the same
> > suggestion on
> >     > > slack earlier. I can add if I have the perms.
> >     > > >
> >     > > > >> if I connect to the Hive catalog on EMR, which is able to
> > provide
> >     > the
> >     > > >     Hudi views correctly, I should be able to get correct
> > results on
> >     > > Athena
> >     > > >
> >     > > > As Vinoth mentioned, just connecting to metastore is not
> enough.
> > Athena
> >     > > would still use its own Presto which does not support Hudi.
> >     > > >
> >     > > > As for Hudi support for Athena:
> >     > > > Athena does use Presto, but it's their own custom version and I
> > don't
> >     > > think they yet have the code that Hudi guys contributed to presto
> > i.e.
> >     > the
> >     > > split annotations etc. Also they don’t have Hudi jars in presto
> >     > classpath.
> >     > > We are not sure of any timelines for this support, but I have
> > heard that
> >     > > work should start soon.
> >     > > >
> >     > > > Thanks,
> >     > > > Udit
> >     > > >
> >     > > > On 2/18/20, 11:27 AM, "Vinoth Chandar" <vi...@apache.org>
> > wrote:
> >     > > >
> >     > > >     Thanks everyone for chiming in. Esp Gary for the detailed
> >     > > workaround..
> >     > > >     (should we FAQ this workaround.. food for thought)
> >     > > >
> >     > > >     >> if I connect to the Hive catalog on EMR, which is able
> to
> >     > provide
> >     > > the
> >     > > >     Hudi views correctly, I should be able to get correct
> > results on
> >     > > Athena
> >     > > >
> >     > > >     Knowing how the Presto/Hudi integration works, simply being
> > able to
> >     > > read
> >     > > >     from Hive metastore is not enough. Presto has code to
> > specially
> >     > > recognize
> >     > > >     Hudi tables and does an additional filtering step, which
> > lets it
> >     > > query the
> >     > > >     data in there correctly. (Gary's workaround above keeps
> just
> > 1
> >     > > version
> >     > > >     around for a given file (group))..
> >     > > >
> >     > > >     On Mon, Feb 17, 2020 at 11:28 PM Gary Li <
> > yanjia.gary.li@gmail.com
> >     > >
> >     > > wrote:
> >     > > >
> >     > > >     > Hello, I don't have any experience working with Athena
> but
> > I can
> >     > > share my
> >     > > >     > experience working with Impala. There is a workaround.
> >     > > >     > By setting Hudi config:
> >     > > >     >
> >     > > >     >    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
> >     > > >     >    - hoodie.cleaner.fileversions.retained=1
> >     > > >     >
> >     > > >     > You will have your Hudi dataset as same as plain parquet
> > files.
> >     > > You can
> >     > > >     > create a table just like regular parquet. Hudi will write
> > a new
> >     > > commit
> >     > > >     > first then delete the older files that have two versions.
> > You
> >     > need
> >     > > to
> >     > > >     > refresh the table metadata store as soon as the Hudi
> > Upsert job
> >     > > finishes.
> >     > > >     > For impala, it's simply REFRESH TABLE xxx. After Hudi
> > vacuumed
> >     > the
> >     > > older
> >     > > >     > files and before refresh the table metastore, the table
> > will be
> >     > > unavailable
> >     > > >     > for query(1-5 mins in my case).
> >     > > >     >
> >     > > >     > How can we process S3 parquet files(hourly partitioned)
> > through
> >     > > Apache
> >     > > >     > Hudi? Is there any streaming layer we need to introduce?
> >     > > >     > -----------
> >     > > >     > Hudi Delta streamer support parquet file. You can do a
> > bulkInsert
> >     > > for the
> >     > > >     > first job then use delta streamer for the Upsert job.
> >     > > >     >
> >     > > >     > 3 - What should be the parquet file size and row group
> > size for
> >     > > better
> >     > > >     > performance on querying Hudi Dataset?
> >     > > >     > ----------
> >     > > >     > That depends on the query engine you are using and it
> > should be
> >     > > documented
> >     > > >     > somewhere. For impala, the optimal size for query
> > performance is
> >     > > 256MB, but
> >     > > >     > the larger file size will make upsert more expensive. The
> > size I
> >     > > personally
> >     > > >     > choose is 100MB to 128MB.
> >     > > >     >
> >     > > >     > Thanks,
> >     > > >     > Gary
> >     > > >     >
> >     > > >     >
> >     > > >     >
> >     > > >     > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
> >     > > <ra...@amazon.com.invalid>
> >     > > >     > wrote:
> >     > > >     >
> >     > > >     > > Athena is indeed Presto inside, but there is lot of
> > custom code
> >     > > which has
> >     > > >     > > gone on top of Presto there.
> >     > > >     > > Couple months back I tried running a glue crawler to
> > catalog a
> >     > > Hudi data
> >     > > >     > > set and then query it from Athena. The results were not
> > same as
> >     > > what I
> >     > > >     > > would get with running the same query using spark SQL
> on
> > EMR.
> >     > > Did not try
> >     > > >     > > Presto on EMR, but assuming it will work fine on EMR.
> >     > > >     > >
> >     > > >     > > Athena integration with Hudi data set is planned
> > shortly, but
> >     > > not sure of
> >     > > >     > > the date yet.
> >     > > >     > >
> >     > > >     > > However, recently Athena started supporting integration
> > to a
> >     > > Hive catalog
> >     > > >     > > apart from Glue. What that means is in Athena, if I
> > connect to
> >     > > the Hive
> >     > > >     > > catalog on EMR, which is able to provide the Hudi views
> >     > > correctly, I
> >     > > >     > should
> >     > > >     > > be able to get correct results on Athena. Have not
> > tested it
> >     > > though. The
> >     > > >     > > feature is in Preview already.
> >     > > >     > >
> >     > > >     > > Thanks
> >     > > >     > > Raghu
> >     > > >     > > -----Original Message-----
> >     > > >     > > From: Shiyan Xu <xu...@gmail.com>
> >     > > >     > > Sent: Tuesday, February 18, 2020 6:20 AM
> >     > > >     > > To: dev@hudi.apache.org
> >     > > >     > > Cc: Mehrotra, Udit <ud...@amazon.com>; Raghvendra
> Dhar
> > Dubey
> >     > > >     > > <ra...@delhivery.com.invalid>
> >     > > >     > > Subject: Re: Apache Hudi on AWS EMR
> >     > > >     > >
> >     > > >     > > For 2) I think running presto on EMR is able to let you
> > run
> >     > > >     > read-optimized
> >     > > >     > > queries.
> >     > > >     > > I don't quite understand how exactly Athena not support
> > Hudi as
> >     > > it is
> >     > > >     > > Presto underlying.
> >     > > >     > > Perhaps @Udit could give some insights from AWS?
> >     > > >     > >
> >     > > >     > > As @Raghvendra you mentioned, another option is to
> > export Hudi
> >     > > dataset to
> >     > > >     > > plain parquet files for Athena to query on
> >     > > >     > > RFC-9 is for this usecase
> >     > > >     > >
> >     > > >     > >
> >     > > >     >
> >     > >
> >     >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
> >     > > >     > > The task is inactive now. Feel free to pick up if this
> is
> >     > > something you'd
> >     > > >     > > like to work on. I'd be happy to help with that.
> >     > > >     > >
> >     > > >     > >
> >     > > >     > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <
> >     > > vinoth@apache.org>
> >     > > >     > wrote:
> >     > > >     > >
> >     > > >     > > > Hi Raghvendra,
> >     > > >     > > >
> >     > > >     > > > Quick sidebar.. Please subscribe to the mailing list,
> > so your
> >     > > message
> >     > > >     > > > get published automatically. :)
> >     > > >     > > >
> >     > > >     > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
> >     > > >     > > > <ra...@delhivery.com.invalid> wrote:
> >     > > >     > > >
> >     > > >     > > > > Hi Udit,
> >     > > >     > > > >
> >     > > >     > > > > Thanks for information.
> >     > > >     > > > > Actually I am struggling on following points
> >     > > >     > > > > 1 - How can we process S3 parquet files(hourly
> > partitioned)
> >     > > through
> >     > > >     > > > Apache
> >     > > >     > > > > Hudi? Is there any streaming layer we need to
> > introduce? 2
> >     > -
> >     > > Is
> >     > > >     > > > > there any workaround to query Hudi Dataset from
> > Athena? we
> >     > > are
> >     > > >     > > > > thinking to dump resulting Hudi dataset to S3, and
> > then
> >     > > querying
> >     > > >     > > > > from Athena. 3 - What should be the parquet file
> > size and
> >     > > row group
> >     > > >     > > > > size for better performance on querying Hudi
> Dataset?
> >     > > >     > > > >
> >     > > >     > > > > Thanks
> >     > > >     > > > > Raghvendra
> >     > > >     > > > >
> >     > > >     > > > >
> >     > > >     > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <
> >     > > uditme@amazon.com>
> >     > > >     > > > wrote:
> >     > > >     > > > >
> >     > > >     > > > > > Hi Raghvendra,
> >     > > >     > > > > >
> >     > > >     > > > > > You would have to re-write you Parquet Dataset in
> > Hudi
> >     > > format.
> >     > > >     > > > > > Here are the links you can follow to get started:
> >     > > >     > > > > >
> >     > > >     > > > > >
> >     > > >     > > > >
> >     > > >     > > >
> >     > >
> > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
> >     > > >     > > > -dataset.html
> >     > > >     > > > > >
> >     > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
> >     > > >     > > > > >
> >     > > >     > > > > > Thanks,
> >     > > >     > > > > > Udit
> >     > > >     > > > > >
> >     > > >     > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> >     > > >     > > > > > <ra...@delhivery.com.INVALID>
> wrote:
> >     > > >     > > > > >
> >     > > >     > > > > >     Hi Team,
> >     > > >     > > > > >
> >     > > >     > > > > >     I want to setup incremental view of my AWS S3
> > parquet
> >     > > data
> >     > > >     > > > > > through Apache
> >     > > >     > > > > >     Hudi, and want to query this data through
> > Athena, but
> >     > > >     > > > > > currently
> >     > > >     > > > > Athena
> >     > > >     > > > > > not
> >     > > >     > > > > >     supporting Hudi Dataset.
> >     > > >     > > > > >
> >     > > >     > > > > >     so there are few questions which I want to
> > understand
> >     > > here
> >     > > >     > > > > >
> >     > > >     > > > > >     1 - How to stream s3 parquet file to Hudi
> > dataset
> >     > > running on
> >     > > >     > EMR.
> >     > > >     > > > > >
> >     > > >     > > > > >     2 - How to query Hudi Dataset running on EMR
> >     > > >     > > > > >
> >     > > >     > > > > >     Please help me to understand this.
> >     > > >     > > > > >
> >     > > >     > > > > >     Thanks
> >     > > >     > > > > >
> >     > > >     > > > > >     Raghvendra
> >     > > >     > > > > >
> >     > > >     > > > > >
> >     > > >     > > > > >
> >     > > >     > > > >
> >     > > >     > > >
> >     > > >     > >
> >     > > >     >
> >     > > >
> >     > > >
> >     > > >
> >     > >
> >     >
> >
> >
> >
>

Re: Apache Hudi on AWS EMR

Posted by Raghvendra Dhar Dubey <ra...@delhivery.com.INVALID>.
Hi Udit,

I tried Hudi version 0.5.1, and it worked fine, this issue was appeared
with Hudi 0.5.0. other EMR related issues has been discussed with Rahul.
Thanks to all of you for cooperation.

Thanks
Raghvendra

On Fri, Feb 28, 2020 at 5:34 AM Mehrotra, Udit <ud...@amazon.com> wrote:

> Raghvendra,
>
> Can you enable TRACE level logging for Hudi on EMR, and provide the error
> logs. For this go to /etc/spark/conf/log4j.properties and change logging
> level of log4j.logger.org.apache.hudi to TRACE. This would help provide the
> failed record/keys based off
> https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L287
>
> Another thing that would help is to provide the Avro schema that gets
> printed on the driver when you run your job. We need to understand which
> field and why it is treated as INT96, because current parquet-avro does not
> handle its conversion. Also, for any other questions about EMR we can
> discuss it in the meeting you have setup with Rahul from EMR team.
>
> Thanks,
> Udit
>
> On 2/27/20, 11:00 AM, "Shiyan Xu" <xu...@gmail.com> wrote:
>
>     +1 on the idea. Giving an config like `--error-path` where all failed
>     conversions are saved provides flexibility for later processing.
> SQS/SNS
>     can pick that up later.
>
>     On Thu, Feb 27, 2020 at 8:10 AM Vinoth Chandar <vi...@apache.org>
> wrote:
>
>     > On the second part, it seems like a question for EMR folks ?
>     >
>     > Hudi's RDD level APIs, do hand the failure records back and .. May
> be we
>     > should consider writing out the error records somewhere for the
> datasource
>     > as well.?
>     > others any thoughts?
>     >
>     > On Mon, Feb 24, 2020 at 10:59 PM Raghvendra Dhar Dubey
>     > <ra...@delhivery.com.invalid> wrote:
>     >
>     > > Thanks Gary and Udit,
>     > >
>     > > I tried HudiDeltaStreamer for reading parquet files from s3  but
> there is
>     > > an issue while AvroSchemaConverter not able to convert Parquet
> INT96. so
>     > I
>     > > thought to use Spark Structured Streaming to read data from s3 and
> write
>     > > into Hudi, but as Databricks providing "cloudfiles" for failure
> handling,
>     > > Is there something in EMR? or do we need to manually handle this
> failure
>     > by
>     > > introducing SQS and SNS?
>     > >
>     > >
>     > >
>     > > On 2020/02/18 20:03:16, "Mehrotra, Udit" <uditme@amazon.com.INVALID
> >
>     > > wrote:
>     > > > Workaround provided by Gary can help querying Hudi tables through
>     > Athena
>     > > for Copy On Write tables by basically querying only the latest
> commit
>     > files
>     > > as standard parquet. It would definitely be worth documenting, as
> several
>     > > people have asked for it and I remember providing the same
> suggestion on
>     > > slack earlier. I can add if I have the perms.
>     > > >
>     > > > >> if I connect to the Hive catalog on EMR, which is able to
> provide
>     > the
>     > > >     Hudi views correctly, I should be able to get correct
> results on
>     > > Athena
>     > > >
>     > > > As Vinoth mentioned, just connecting to metastore is not enough.
> Athena
>     > > would still use its own Presto which does not support Hudi.
>     > > >
>     > > > As for Hudi support for Athena:
>     > > > Athena does use Presto, but it's their own custom version and I
> don't
>     > > think they yet have the code that Hudi guys contributed to presto
> i.e.
>     > the
>     > > split annotations etc. Also they don’t have Hudi jars in presto
>     > classpath.
>     > > We are not sure of any timelines for this support, but I have
> heard that
>     > > work should start soon.
>     > > >
>     > > > Thanks,
>     > > > Udit
>     > > >
>     > > > On 2/18/20, 11:27 AM, "Vinoth Chandar" <vi...@apache.org>
> wrote:
>     > > >
>     > > >     Thanks everyone for chiming in. Esp Gary for the detailed
>     > > workaround..
>     > > >     (should we FAQ this workaround.. food for thought)
>     > > >
>     > > >     >> if I connect to the Hive catalog on EMR, which is able to
>     > provide
>     > > the
>     > > >     Hudi views correctly, I should be able to get correct
> results on
>     > > Athena
>     > > >
>     > > >     Knowing how the Presto/Hudi integration works, simply being
> able to
>     > > read
>     > > >     from Hive metastore is not enough. Presto has code to
> specially
>     > > recognize
>     > > >     Hudi tables and does an additional filtering step, which
> lets it
>     > > query the
>     > > >     data in there correctly. (Gary's workaround above keeps just
> 1
>     > > version
>     > > >     around for a given file (group))..
>     > > >
>     > > >     On Mon, Feb 17, 2020 at 11:28 PM Gary Li <
> yanjia.gary.li@gmail.com
>     > >
>     > > wrote:
>     > > >
>     > > >     > Hello, I don't have any experience working with Athena but
> I can
>     > > share my
>     > > >     > experience working with Impala. There is a workaround.
>     > > >     > By setting Hudi config:
>     > > >     >
>     > > >     >    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
>     > > >     >    - hoodie.cleaner.fileversions.retained=1
>     > > >     >
>     > > >     > You will have your Hudi dataset as same as plain parquet
> files.
>     > > You can
>     > > >     > create a table just like regular parquet. Hudi will write
> a new
>     > > commit
>     > > >     > first then delete the older files that have two versions.
> You
>     > need
>     > > to
>     > > >     > refresh the table metadata store as soon as the Hudi
> Upsert job
>     > > finishes.
>     > > >     > For impala, it's simply REFRESH TABLE xxx. After Hudi
> vacuumed
>     > the
>     > > older
>     > > >     > files and before refresh the table metastore, the table
> will be
>     > > unavailable
>     > > >     > for query(1-5 mins in my case).
>     > > >     >
>     > > >     > How can we process S3 parquet files(hourly partitioned)
> through
>     > > Apache
>     > > >     > Hudi? Is there any streaming layer we need to introduce?
>     > > >     > -----------
>     > > >     > Hudi Delta streamer support parquet file. You can do a
> bulkInsert
>     > > for the
>     > > >     > first job then use delta streamer for the Upsert job.
>     > > >     >
>     > > >     > 3 - What should be the parquet file size and row group
> size for
>     > > better
>     > > >     > performance on querying Hudi Dataset?
>     > > >     > ----------
>     > > >     > That depends on the query engine you are using and it
> should be
>     > > documented
>     > > >     > somewhere. For impala, the optimal size for query
> performance is
>     > > 256MB, but
>     > > >     > the larger file size will make upsert more expensive. The
> size I
>     > > personally
>     > > >     > choose is 100MB to 128MB.
>     > > >     >
>     > > >     > Thanks,
>     > > >     > Gary
>     > > >     >
>     > > >     >
>     > > >     >
>     > > >     > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
>     > > <ra...@amazon.com.invalid>
>     > > >     > wrote:
>     > > >     >
>     > > >     > > Athena is indeed Presto inside, but there is lot of
> custom code
>     > > which has
>     > > >     > > gone on top of Presto there.
>     > > >     > > Couple months back I tried running a glue crawler to
> catalog a
>     > > Hudi data
>     > > >     > > set and then query it from Athena. The results were not
> same as
>     > > what I
>     > > >     > > would get with running the same query using spark SQL on
> EMR.
>     > > Did not try
>     > > >     > > Presto on EMR, but assuming it will work fine on EMR.
>     > > >     > >
>     > > >     > > Athena integration with Hudi data set is planned
> shortly, but
>     > > not sure of
>     > > >     > > the date yet.
>     > > >     > >
>     > > >     > > However, recently Athena started supporting integration
> to a
>     > > Hive catalog
>     > > >     > > apart from Glue. What that means is in Athena, if I
> connect to
>     > > the Hive
>     > > >     > > catalog on EMR, which is able to provide the Hudi views
>     > > correctly, I
>     > > >     > should
>     > > >     > > be able to get correct results on Athena. Have not
> tested it
>     > > though. The
>     > > >     > > feature is in Preview already.
>     > > >     > >
>     > > >     > > Thanks
>     > > >     > > Raghu
>     > > >     > > -----Original Message-----
>     > > >     > > From: Shiyan Xu <xu...@gmail.com>
>     > > >     > > Sent: Tuesday, February 18, 2020 6:20 AM
>     > > >     > > To: dev@hudi.apache.org
>     > > >     > > Cc: Mehrotra, Udit <ud...@amazon.com>; Raghvendra Dhar
> Dubey
>     > > >     > > <ra...@delhivery.com.invalid>
>     > > >     > > Subject: Re: Apache Hudi on AWS EMR
>     > > >     > >
>     > > >     > > For 2) I think running presto on EMR is able to let you
> run
>     > > >     > read-optimized
>     > > >     > > queries.
>     > > >     > > I don't quite understand how exactly Athena not support
> Hudi as
>     > > it is
>     > > >     > > Presto underlying.
>     > > >     > > Perhaps @Udit could give some insights from AWS?
>     > > >     > >
>     > > >     > > As @Raghvendra you mentioned, another option is to
> export Hudi
>     > > dataset to
>     > > >     > > plain parquet files for Athena to query on
>     > > >     > > RFC-9 is for this usecase
>     > > >     > >
>     > > >     > >
>     > > >     >
>     > >
>     >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
>     > > >     > > The task is inactive now. Feel free to pick up if this is
>     > > something you'd
>     > > >     > > like to work on. I'd be happy to help with that.
>     > > >     > >
>     > > >     > >
>     > > >     > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <
>     > > vinoth@apache.org>
>     > > >     > wrote:
>     > > >     > >
>     > > >     > > > Hi Raghvendra,
>     > > >     > > >
>     > > >     > > > Quick sidebar.. Please subscribe to the mailing list,
> so your
>     > > message
>     > > >     > > > get published automatically. :)
>     > > >     > > >
>     > > >     > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
>     > > >     > > > <ra...@delhivery.com.invalid> wrote:
>     > > >     > > >
>     > > >     > > > > Hi Udit,
>     > > >     > > > >
>     > > >     > > > > Thanks for information.
>     > > >     > > > > Actually I am struggling on following points
>     > > >     > > > > 1 - How can we process S3 parquet files(hourly
> partitioned)
>     > > through
>     > > >     > > > Apache
>     > > >     > > > > Hudi? Is there any streaming layer we need to
> introduce? 2
>     > -
>     > > Is
>     > > >     > > > > there any workaround to query Hudi Dataset from
> Athena? we
>     > > are
>     > > >     > > > > thinking to dump resulting Hudi dataset to S3, and
> then
>     > > querying
>     > > >     > > > > from Athena. 3 - What should be the parquet file
> size and
>     > > row group
>     > > >     > > > > size for better performance on querying Hudi Dataset?
>     > > >     > > > >
>     > > >     > > > > Thanks
>     > > >     > > > > Raghvendra
>     > > >     > > > >
>     > > >     > > > >
>     > > >     > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <
>     > > uditme@amazon.com>
>     > > >     > > > wrote:
>     > > >     > > > >
>     > > >     > > > > > Hi Raghvendra,
>     > > >     > > > > >
>     > > >     > > > > > You would have to re-write you Parquet Dataset in
> Hudi
>     > > format.
>     > > >     > > > > > Here are the links you can follow to get started:
>     > > >     > > > > >
>     > > >     > > > > >
>     > > >     > > > >
>     > > >     > > >
>     > >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
>     > > >     > > > -dataset.html
>     > > >     > > > > >
>     > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
>     > > >     > > > > >
>     > > >     > > > > > Thanks,
>     > > >     > > > > > Udit
>     > > >     > > > > >
>     > > >     > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
>     > > >     > > > > > <ra...@delhivery.com.INVALID> wrote:
>     > > >     > > > > >
>     > > >     > > > > >     Hi Team,
>     > > >     > > > > >
>     > > >     > > > > >     I want to setup incremental view of my AWS S3
> parquet
>     > > data
>     > > >     > > > > > through Apache
>     > > >     > > > > >     Hudi, and want to query this data through
> Athena, but
>     > > >     > > > > > currently
>     > > >     > > > > Athena
>     > > >     > > > > > not
>     > > >     > > > > >     supporting Hudi Dataset.
>     > > >     > > > > >
>     > > >     > > > > >     so there are few questions which I want to
> understand
>     > > here
>     > > >     > > > > >
>     > > >     > > > > >     1 - How to stream s3 parquet file to Hudi
> dataset
>     > > running on
>     > > >     > EMR.
>     > > >     > > > > >
>     > > >     > > > > >     2 - How to query Hudi Dataset running on EMR
>     > > >     > > > > >
>     > > >     > > > > >     Please help me to understand this.
>     > > >     > > > > >
>     > > >     > > > > >     Thanks
>     > > >     > > > > >
>     > > >     > > > > >     Raghvendra
>     > > >     > > > > >
>     > > >     > > > > >
>     > > >     > > > > >
>     > > >     > > > >
>     > > >     > > >
>     > > >     > >
>     > > >     >
>     > > >
>     > > >
>     > > >
>     > >
>     >
>
>
>

Re: Apache Hudi on AWS EMR

Posted by "Mehrotra, Udit" <ud...@amazon.com.INVALID>.
Raghvendra,

Can you enable TRACE level logging for Hudi on EMR, and provide the error logs. For this go to /etc/spark/conf/log4j.properties and change logging level of log4j.logger.org.apache.hudi to TRACE. This would help provide the failed record/keys based off https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L287 

Another thing that would help is to provide the Avro schema that gets printed on the driver when you run your job. We need to understand which field and why it is treated as INT96, because current parquet-avro does not handle its conversion. Also, for any other questions about EMR we can discuss it in the meeting you have setup with Rahul from EMR team.

Thanks,
Udit

On 2/27/20, 11:00 AM, "Shiyan Xu" <xu...@gmail.com> wrote:

    +1 on the idea. Giving an config like `--error-path` where all failed
    conversions are saved provides flexibility for later processing. SQS/SNS
    can pick that up later.
    
    On Thu, Feb 27, 2020 at 8:10 AM Vinoth Chandar <vi...@apache.org> wrote:
    
    > On the second part, it seems like a question for EMR folks ?
    >
    > Hudi's RDD level APIs, do hand the failure records back and .. May be we
    > should consider writing out the error records somewhere for the datasource
    > as well.?
    > others any thoughts?
    >
    > On Mon, Feb 24, 2020 at 10:59 PM Raghvendra Dhar Dubey
    > <ra...@delhivery.com.invalid> wrote:
    >
    > > Thanks Gary and Udit,
    > >
    > > I tried HudiDeltaStreamer for reading parquet files from s3  but there is
    > > an issue while AvroSchemaConverter not able to convert Parquet INT96. so
    > I
    > > thought to use Spark Structured Streaming to read data from s3 and write
    > > into Hudi, but as Databricks providing "cloudfiles" for failure handling,
    > > Is there something in EMR? or do we need to manually handle this failure
    > by
    > > introducing SQS and SNS?
    > >
    > >
    > >
    > > On 2020/02/18 20:03:16, "Mehrotra, Udit" <ud...@amazon.com.INVALID>
    > > wrote:
    > > > Workaround provided by Gary can help querying Hudi tables through
    > Athena
    > > for Copy On Write tables by basically querying only the latest commit
    > files
    > > as standard parquet. It would definitely be worth documenting, as several
    > > people have asked for it and I remember providing the same suggestion on
    > > slack earlier. I can add if I have the perms.
    > > >
    > > > >> if I connect to the Hive catalog on EMR, which is able to provide
    > the
    > > >     Hudi views correctly, I should be able to get correct results on
    > > Athena
    > > >
    > > > As Vinoth mentioned, just connecting to metastore is not enough. Athena
    > > would still use its own Presto which does not support Hudi.
    > > >
    > > > As for Hudi support for Athena:
    > > > Athena does use Presto, but it's their own custom version and I don't
    > > think they yet have the code that Hudi guys contributed to presto i.e.
    > the
    > > split annotations etc. Also they don’t have Hudi jars in presto
    > classpath.
    > > We are not sure of any timelines for this support, but I have heard that
    > > work should start soon.
    > > >
    > > > Thanks,
    > > > Udit
    > > >
    > > > On 2/18/20, 11:27 AM, "Vinoth Chandar" <vi...@apache.org> wrote:
    > > >
    > > >     Thanks everyone for chiming in. Esp Gary for the detailed
    > > workaround..
    > > >     (should we FAQ this workaround.. food for thought)
    > > >
    > > >     >> if I connect to the Hive catalog on EMR, which is able to
    > provide
    > > the
    > > >     Hudi views correctly, I should be able to get correct results on
    > > Athena
    > > >
    > > >     Knowing how the Presto/Hudi integration works, simply being able to
    > > read
    > > >     from Hive metastore is not enough. Presto has code to specially
    > > recognize
    > > >     Hudi tables and does an additional filtering step, which lets it
    > > query the
    > > >     data in there correctly. (Gary's workaround above keeps just 1
    > > version
    > > >     around for a given file (group))..
    > > >
    > > >     On Mon, Feb 17, 2020 at 11:28 PM Gary Li <yanjia.gary.li@gmail.com
    > >
    > > wrote:
    > > >
    > > >     > Hello, I don't have any experience working with Athena but I can
    > > share my
    > > >     > experience working with Impala. There is a workaround.
    > > >     > By setting Hudi config:
    > > >     >
    > > >     >    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
    > > >     >    - hoodie.cleaner.fileversions.retained=1
    > > >     >
    > > >     > You will have your Hudi dataset as same as plain parquet files.
    > > You can
    > > >     > create a table just like regular parquet. Hudi will write a new
    > > commit
    > > >     > first then delete the older files that have two versions. You
    > need
    > > to
    > > >     > refresh the table metadata store as soon as the Hudi Upsert job
    > > finishes.
    > > >     > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed
    > the
    > > older
    > > >     > files and before refresh the table metastore, the table will be
    > > unavailable
    > > >     > for query(1-5 mins in my case).
    > > >     >
    > > >     > How can we process S3 parquet files(hourly partitioned) through
    > > Apache
    > > >     > Hudi? Is there any streaming layer we need to introduce?
    > > >     > -----------
    > > >     > Hudi Delta streamer support parquet file. You can do a bulkInsert
    > > for the
    > > >     > first job then use delta streamer for the Upsert job.
    > > >     >
    > > >     > 3 - What should be the parquet file size and row group size for
    > > better
    > > >     > performance on querying Hudi Dataset?
    > > >     > ----------
    > > >     > That depends on the query engine you are using and it should be
    > > documented
    > > >     > somewhere. For impala, the optimal size for query performance is
    > > 256MB, but
    > > >     > the larger file size will make upsert more expensive. The size I
    > > personally
    > > >     > choose is 100MB to 128MB.
    > > >     >
    > > >     > Thanks,
    > > >     > Gary
    > > >     >
    > > >     >
    > > >     >
    > > >     > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
    > > <ra...@amazon.com.invalid>
    > > >     > wrote:
    > > >     >
    > > >     > > Athena is indeed Presto inside, but there is lot of custom code
    > > which has
    > > >     > > gone on top of Presto there.
    > > >     > > Couple months back I tried running a glue crawler to catalog a
    > > Hudi data
    > > >     > > set and then query it from Athena. The results were not same as
    > > what I
    > > >     > > would get with running the same query using spark SQL on EMR.
    > > Did not try
    > > >     > > Presto on EMR, but assuming it will work fine on EMR.
    > > >     > >
    > > >     > > Athena integration with Hudi data set is planned shortly, but
    > > not sure of
    > > >     > > the date yet.
    > > >     > >
    > > >     > > However, recently Athena started supporting integration to a
    > > Hive catalog
    > > >     > > apart from Glue. What that means is in Athena, if I connect to
    > > the Hive
    > > >     > > catalog on EMR, which is able to provide the Hudi views
    > > correctly, I
    > > >     > should
    > > >     > > be able to get correct results on Athena. Have not tested it
    > > though. The
    > > >     > > feature is in Preview already.
    > > >     > >
    > > >     > > Thanks
    > > >     > > Raghu
    > > >     > > -----Original Message-----
    > > >     > > From: Shiyan Xu <xu...@gmail.com>
    > > >     > > Sent: Tuesday, February 18, 2020 6:20 AM
    > > >     > > To: dev@hudi.apache.org
    > > >     > > Cc: Mehrotra, Udit <ud...@amazon.com>; Raghvendra Dhar Dubey
    > > >     > > <ra...@delhivery.com.invalid>
    > > >     > > Subject: Re: Apache Hudi on AWS EMR
    > > >     > >
    > > >     > > For 2) I think running presto on EMR is able to let you run
    > > >     > read-optimized
    > > >     > > queries.
    > > >     > > I don't quite understand how exactly Athena not support Hudi as
    > > it is
    > > >     > > Presto underlying.
    > > >     > > Perhaps @Udit could give some insights from AWS?
    > > >     > >
    > > >     > > As @Raghvendra you mentioned, another option is to export Hudi
    > > dataset to
    > > >     > > plain parquet files for Athena to query on
    > > >     > > RFC-9 is for this usecase
    > > >     > >
    > > >     > >
    > > >     >
    > >
    > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
    > > >     > > The task is inactive now. Feel free to pick up if this is
    > > something you'd
    > > >     > > like to work on. I'd be happy to help with that.
    > > >     > >
    > > >     > >
    > > >     > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <
    > > vinoth@apache.org>
    > > >     > wrote:
    > > >     > >
    > > >     > > > Hi Raghvendra,
    > > >     > > >
    > > >     > > > Quick sidebar.. Please subscribe to the mailing list, so your
    > > message
    > > >     > > > get published automatically. :)
    > > >     > > >
    > > >     > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
    > > >     > > > <ra...@delhivery.com.invalid> wrote:
    > > >     > > >
    > > >     > > > > Hi Udit,
    > > >     > > > >
    > > >     > > > > Thanks for information.
    > > >     > > > > Actually I am struggling on following points
    > > >     > > > > 1 - How can we process S3 parquet files(hourly partitioned)
    > > through
    > > >     > > > Apache
    > > >     > > > > Hudi? Is there any streaming layer we need to introduce? 2
    > -
    > > Is
    > > >     > > > > there any workaround to query Hudi Dataset from Athena? we
    > > are
    > > >     > > > > thinking to dump resulting Hudi dataset to S3, and then
    > > querying
    > > >     > > > > from Athena. 3 - What should be the parquet file size and
    > > row group
    > > >     > > > > size for better performance on querying Hudi Dataset?
    > > >     > > > >
    > > >     > > > > Thanks
    > > >     > > > > Raghvendra
    > > >     > > > >
    > > >     > > > >
    > > >     > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <
    > > uditme@amazon.com>
    > > >     > > > wrote:
    > > >     > > > >
    > > >     > > > > > Hi Raghvendra,
    > > >     > > > > >
    > > >     > > > > > You would have to re-write you Parquet Dataset in Hudi
    > > format.
    > > >     > > > > > Here are the links you can follow to get started:
    > > >     > > > > >
    > > >     > > > > >
    > > >     > > > >
    > > >     > > >
    > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
    > > >     > > > -dataset.html
    > > >     > > > > >
    > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
    > > >     > > > > >
    > > >     > > > > > Thanks,
    > > >     > > > > > Udit
    > > >     > > > > >
    > > >     > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
    > > >     > > > > > <ra...@delhivery.com.INVALID> wrote:
    > > >     > > > > >
    > > >     > > > > >     Hi Team,
    > > >     > > > > >
    > > >     > > > > >     I want to setup incremental view of my AWS S3 parquet
    > > data
    > > >     > > > > > through Apache
    > > >     > > > > >     Hudi, and want to query this data through Athena, but
    > > >     > > > > > currently
    > > >     > > > > Athena
    > > >     > > > > > not
    > > >     > > > > >     supporting Hudi Dataset.
    > > >     > > > > >
    > > >     > > > > >     so there are few questions which I want to understand
    > > here
    > > >     > > > > >
    > > >     > > > > >     1 - How to stream s3 parquet file to Hudi dataset
    > > running on
    > > >     > EMR.
    > > >     > > > > >
    > > >     > > > > >     2 - How to query Hudi Dataset running on EMR
    > > >     > > > > >
    > > >     > > > > >     Please help me to understand this.
    > > >     > > > > >
    > > >     > > > > >     Thanks
    > > >     > > > > >
    > > >     > > > > >     Raghvendra
    > > >     > > > > >
    > > >     > > > > >
    > > >     > > > > >
    > > >     > > > >
    > > >     > > >
    > > >     > >
    > > >     >
    > > >
    > > >
    > > >
    > >
    >
    


Re: Apache Hudi on AWS EMR

Posted by Shiyan Xu <xu...@gmail.com>.
+1 on the idea. Giving an config like `--error-path` where all failed
conversions are saved provides flexibility for later processing. SQS/SNS
can pick that up later.

On Thu, Feb 27, 2020 at 8:10 AM Vinoth Chandar <vi...@apache.org> wrote:

> On the second part, it seems like a question for EMR folks ?
>
> Hudi's RDD level APIs, do hand the failure records back and .. May be we
> should consider writing out the error records somewhere for the datasource
> as well.?
> others any thoughts?
>
> On Mon, Feb 24, 2020 at 10:59 PM Raghvendra Dhar Dubey
> <ra...@delhivery.com.invalid> wrote:
>
> > Thanks Gary and Udit,
> >
> > I tried HudiDeltaStreamer for reading parquet files from s3  but there is
> > an issue while AvroSchemaConverter not able to convert Parquet INT96. so
> I
> > thought to use Spark Structured Streaming to read data from s3 and write
> > into Hudi, but as Databricks providing "cloudfiles" for failure handling,
> > Is there something in EMR? or do we need to manually handle this failure
> by
> > introducing SQS and SNS?
> >
> >
> >
> > On 2020/02/18 20:03:16, "Mehrotra, Udit" <ud...@amazon.com.INVALID>
> > wrote:
> > > Workaround provided by Gary can help querying Hudi tables through
> Athena
> > for Copy On Write tables by basically querying only the latest commit
> files
> > as standard parquet. It would definitely be worth documenting, as several
> > people have asked for it and I remember providing the same suggestion on
> > slack earlier. I can add if I have the perms.
> > >
> > > >> if I connect to the Hive catalog on EMR, which is able to provide
> the
> > >     Hudi views correctly, I should be able to get correct results on
> > Athena
> > >
> > > As Vinoth mentioned, just connecting to metastore is not enough. Athena
> > would still use its own Presto which does not support Hudi.
> > >
> > > As for Hudi support for Athena:
> > > Athena does use Presto, but it's their own custom version and I don't
> > think they yet have the code that Hudi guys contributed to presto i.e.
> the
> > split annotations etc. Also they don’t have Hudi jars in presto
> classpath.
> > We are not sure of any timelines for this support, but I have heard that
> > work should start soon.
> > >
> > > Thanks,
> > > Udit
> > >
> > > On 2/18/20, 11:27 AM, "Vinoth Chandar" <vi...@apache.org> wrote:
> > >
> > >     Thanks everyone for chiming in. Esp Gary for the detailed
> > workaround..
> > >     (should we FAQ this workaround.. food for thought)
> > >
> > >     >> if I connect to the Hive catalog on EMR, which is able to
> provide
> > the
> > >     Hudi views correctly, I should be able to get correct results on
> > Athena
> > >
> > >     Knowing how the Presto/Hudi integration works, simply being able to
> > read
> > >     from Hive metastore is not enough. Presto has code to specially
> > recognize
> > >     Hudi tables and does an additional filtering step, which lets it
> > query the
> > >     data in there correctly. (Gary's workaround above keeps just 1
> > version
> > >     around for a given file (group))..
> > >
> > >     On Mon, Feb 17, 2020 at 11:28 PM Gary Li <yanjia.gary.li@gmail.com
> >
> > wrote:
> > >
> > >     > Hello, I don't have any experience working with Athena but I can
> > share my
> > >     > experience working with Impala. There is a workaround.
> > >     > By setting Hudi config:
> > >     >
> > >     >    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
> > >     >    - hoodie.cleaner.fileversions.retained=1
> > >     >
> > >     > You will have your Hudi dataset as same as plain parquet files.
> > You can
> > >     > create a table just like regular parquet. Hudi will write a new
> > commit
> > >     > first then delete the older files that have two versions. You
> need
> > to
> > >     > refresh the table metadata store as soon as the Hudi Upsert job
> > finishes.
> > >     > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed
> the
> > older
> > >     > files and before refresh the table metastore, the table will be
> > unavailable
> > >     > for query(1-5 mins in my case).
> > >     >
> > >     > How can we process S3 parquet files(hourly partitioned) through
> > Apache
> > >     > Hudi? Is there any streaming layer we need to introduce?
> > >     > -----------
> > >     > Hudi Delta streamer support parquet file. You can do a bulkInsert
> > for the
> > >     > first job then use delta streamer for the Upsert job.
> > >     >
> > >     > 3 - What should be the parquet file size and row group size for
> > better
> > >     > performance on querying Hudi Dataset?
> > >     > ----------
> > >     > That depends on the query engine you are using and it should be
> > documented
> > >     > somewhere. For impala, the optimal size for query performance is
> > 256MB, but
> > >     > the larger file size will make upsert more expensive. The size I
> > personally
> > >     > choose is 100MB to 128MB.
> > >     >
> > >     > Thanks,
> > >     > Gary
> > >     >
> > >     >
> > >     >
> > >     > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
> > <ra...@amazon.com.invalid>
> > >     > wrote:
> > >     >
> > >     > > Athena is indeed Presto inside, but there is lot of custom code
> > which has
> > >     > > gone on top of Presto there.
> > >     > > Couple months back I tried running a glue crawler to catalog a
> > Hudi data
> > >     > > set and then query it from Athena. The results were not same as
> > what I
> > >     > > would get with running the same query using spark SQL on EMR.
> > Did not try
> > >     > > Presto on EMR, but assuming it will work fine on EMR.
> > >     > >
> > >     > > Athena integration with Hudi data set is planned shortly, but
> > not sure of
> > >     > > the date yet.
> > >     > >
> > >     > > However, recently Athena started supporting integration to a
> > Hive catalog
> > >     > > apart from Glue. What that means is in Athena, if I connect to
> > the Hive
> > >     > > catalog on EMR, which is able to provide the Hudi views
> > correctly, I
> > >     > should
> > >     > > be able to get correct results on Athena. Have not tested it
> > though. The
> > >     > > feature is in Preview already.
> > >     > >
> > >     > > Thanks
> > >     > > Raghu
> > >     > > -----Original Message-----
> > >     > > From: Shiyan Xu <xu...@gmail.com>
> > >     > > Sent: Tuesday, February 18, 2020 6:20 AM
> > >     > > To: dev@hudi.apache.org
> > >     > > Cc: Mehrotra, Udit <ud...@amazon.com>; Raghvendra Dhar Dubey
> > >     > > <ra...@delhivery.com.invalid>
> > >     > > Subject: Re: Apache Hudi on AWS EMR
> > >     > >
> > >     > > For 2) I think running presto on EMR is able to let you run
> > >     > read-optimized
> > >     > > queries.
> > >     > > I don't quite understand how exactly Athena not support Hudi as
> > it is
> > >     > > Presto underlying.
> > >     > > Perhaps @Udit could give some insights from AWS?
> > >     > >
> > >     > > As @Raghvendra you mentioned, another option is to export Hudi
> > dataset to
> > >     > > plain parquet files for Athena to query on
> > >     > > RFC-9 is for this usecase
> > >     > >
> > >     > >
> > >     >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
> > >     > > The task is inactive now. Feel free to pick up if this is
> > something you'd
> > >     > > like to work on. I'd be happy to help with that.
> > >     > >
> > >     > >
> > >     > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <
> > vinoth@apache.org>
> > >     > wrote:
> > >     > >
> > >     > > > Hi Raghvendra,
> > >     > > >
> > >     > > > Quick sidebar.. Please subscribe to the mailing list, so your
> > message
> > >     > > > get published automatically. :)
> > >     > > >
> > >     > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
> > >     > > > <ra...@delhivery.com.invalid> wrote:
> > >     > > >
> > >     > > > > Hi Udit,
> > >     > > > >
> > >     > > > > Thanks for information.
> > >     > > > > Actually I am struggling on following points
> > >     > > > > 1 - How can we process S3 parquet files(hourly partitioned)
> > through
> > >     > > > Apache
> > >     > > > > Hudi? Is there any streaming layer we need to introduce? 2
> -
> > Is
> > >     > > > > there any workaround to query Hudi Dataset from Athena? we
> > are
> > >     > > > > thinking to dump resulting Hudi dataset to S3, and then
> > querying
> > >     > > > > from Athena. 3 - What should be the parquet file size and
> > row group
> > >     > > > > size for better performance on querying Hudi Dataset?
> > >     > > > >
> > >     > > > > Thanks
> > >     > > > > Raghvendra
> > >     > > > >
> > >     > > > >
> > >     > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <
> > uditme@amazon.com>
> > >     > > > wrote:
> > >     > > > >
> > >     > > > > > Hi Raghvendra,
> > >     > > > > >
> > >     > > > > > You would have to re-write you Parquet Dataset in Hudi
> > format.
> > >     > > > > > Here are the links you can follow to get started:
> > >     > > > > >
> > >     > > > > >
> > >     > > > >
> > >     > > >
> > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
> > >     > > > -dataset.html
> > >     > > > > >
> > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
> > >     > > > > >
> > >     > > > > > Thanks,
> > >     > > > > > Udit
> > >     > > > > >
> > >     > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> > >     > > > > > <ra...@delhivery.com.INVALID> wrote:
> > >     > > > > >
> > >     > > > > >     Hi Team,
> > >     > > > > >
> > >     > > > > >     I want to setup incremental view of my AWS S3 parquet
> > data
> > >     > > > > > through Apache
> > >     > > > > >     Hudi, and want to query this data through Athena, but
> > >     > > > > > currently
> > >     > > > > Athena
> > >     > > > > > not
> > >     > > > > >     supporting Hudi Dataset.
> > >     > > > > >
> > >     > > > > >     so there are few questions which I want to understand
> > here
> > >     > > > > >
> > >     > > > > >     1 - How to stream s3 parquet file to Hudi dataset
> > running on
> > >     > EMR.
> > >     > > > > >
> > >     > > > > >     2 - How to query Hudi Dataset running on EMR
> > >     > > > > >
> > >     > > > > >     Please help me to understand this.
> > >     > > > > >
> > >     > > > > >     Thanks
> > >     > > > > >
> > >     > > > > >     Raghvendra
> > >     > > > > >
> > >     > > > > >
> > >     > > > > >
> > >     > > > >
> > >     > > >
> > >     > >
> > >     >
> > >
> > >
> > >
> >
>

Re: Apache Hudi on AWS EMR

Posted by Vinoth Chandar <vi...@apache.org>.
On the second part, it seems like a question for EMR folks ?

Hudi's RDD level APIs, do hand the failure records back and .. May be we
should consider writing out the error records somewhere for the datasource
as well.?
others any thoughts?

On Mon, Feb 24, 2020 at 10:59 PM Raghvendra Dhar Dubey
<ra...@delhivery.com.invalid> wrote:

> Thanks Gary and Udit,
>
> I tried HudiDeltaStreamer for reading parquet files from s3  but there is
> an issue while AvroSchemaConverter not able to convert Parquet INT96. so I
> thought to use Spark Structured Streaming to read data from s3 and write
> into Hudi, but as Databricks providing "cloudfiles" for failure handling,
> Is there something in EMR? or do we need to manually handle this failure by
> introducing SQS and SNS?
>
>
>
> On 2020/02/18 20:03:16, "Mehrotra, Udit" <ud...@amazon.com.INVALID>
> wrote:
> > Workaround provided by Gary can help querying Hudi tables through Athena
> for Copy On Write tables by basically querying only the latest commit files
> as standard parquet. It would definitely be worth documenting, as several
> people have asked for it and I remember providing the same suggestion on
> slack earlier. I can add if I have the perms.
> >
> > >> if I connect to the Hive catalog on EMR, which is able to provide the
> >     Hudi views correctly, I should be able to get correct results on
> Athena
> >
> > As Vinoth mentioned, just connecting to metastore is not enough. Athena
> would still use its own Presto which does not support Hudi.
> >
> > As for Hudi support for Athena:
> > Athena does use Presto, but it's their own custom version and I don't
> think they yet have the code that Hudi guys contributed to presto i.e. the
> split annotations etc. Also they don’t have Hudi jars in presto classpath.
> We are not sure of any timelines for this support, but I have heard that
> work should start soon.
> >
> > Thanks,
> > Udit
> >
> > On 2/18/20, 11:27 AM, "Vinoth Chandar" <vi...@apache.org> wrote:
> >
> >     Thanks everyone for chiming in. Esp Gary for the detailed
> workaround..
> >     (should we FAQ this workaround.. food for thought)
> >
> >     >> if I connect to the Hive catalog on EMR, which is able to provide
> the
> >     Hudi views correctly, I should be able to get correct results on
> Athena
> >
> >     Knowing how the Presto/Hudi integration works, simply being able to
> read
> >     from Hive metastore is not enough. Presto has code to specially
> recognize
> >     Hudi tables and does an additional filtering step, which lets it
> query the
> >     data in there correctly. (Gary's workaround above keeps just 1
> version
> >     around for a given file (group))..
> >
> >     On Mon, Feb 17, 2020 at 11:28 PM Gary Li <ya...@gmail.com>
> wrote:
> >
> >     > Hello, I don't have any experience working with Athena but I can
> share my
> >     > experience working with Impala. There is a workaround.
> >     > By setting Hudi config:
> >     >
> >     >    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
> >     >    - hoodie.cleaner.fileversions.retained=1
> >     >
> >     > You will have your Hudi dataset as same as plain parquet files.
> You can
> >     > create a table just like regular parquet. Hudi will write a new
> commit
> >     > first then delete the older files that have two versions. You need
> to
> >     > refresh the table metadata store as soon as the Hudi Upsert job
> finishes.
> >     > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed the
> older
> >     > files and before refresh the table metastore, the table will be
> unavailable
> >     > for query(1-5 mins in my case).
> >     >
> >     > How can we process S3 parquet files(hourly partitioned) through
> Apache
> >     > Hudi? Is there any streaming layer we need to introduce?
> >     > -----------
> >     > Hudi Delta streamer support parquet file. You can do a bulkInsert
> for the
> >     > first job then use delta streamer for the Upsert job.
> >     >
> >     > 3 - What should be the parquet file size and row group size for
> better
> >     > performance on querying Hudi Dataset?
> >     > ----------
> >     > That depends on the query engine you are using and it should be
> documented
> >     > somewhere. For impala, the optimal size for query performance is
> 256MB, but
> >     > the larger file size will make upsert more expensive. The size I
> personally
> >     > choose is 100MB to 128MB.
> >     >
> >     > Thanks,
> >     > Gary
> >     >
> >     >
> >     >
> >     > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
> <ra...@amazon.com.invalid>
> >     > wrote:
> >     >
> >     > > Athena is indeed Presto inside, but there is lot of custom code
> which has
> >     > > gone on top of Presto there.
> >     > > Couple months back I tried running a glue crawler to catalog a
> Hudi data
> >     > > set and then query it from Athena. The results were not same as
> what I
> >     > > would get with running the same query using spark SQL on EMR.
> Did not try
> >     > > Presto on EMR, but assuming it will work fine on EMR.
> >     > >
> >     > > Athena integration with Hudi data set is planned shortly, but
> not sure of
> >     > > the date yet.
> >     > >
> >     > > However, recently Athena started supporting integration to a
> Hive catalog
> >     > > apart from Glue. What that means is in Athena, if I connect to
> the Hive
> >     > > catalog on EMR, which is able to provide the Hudi views
> correctly, I
> >     > should
> >     > > be able to get correct results on Athena. Have not tested it
> though. The
> >     > > feature is in Preview already.
> >     > >
> >     > > Thanks
> >     > > Raghu
> >     > > -----Original Message-----
> >     > > From: Shiyan Xu <xu...@gmail.com>
> >     > > Sent: Tuesday, February 18, 2020 6:20 AM
> >     > > To: dev@hudi.apache.org
> >     > > Cc: Mehrotra, Udit <ud...@amazon.com>; Raghvendra Dhar Dubey
> >     > > <ra...@delhivery.com.invalid>
> >     > > Subject: Re: Apache Hudi on AWS EMR
> >     > >
> >     > > For 2) I think running presto on EMR is able to let you run
> >     > read-optimized
> >     > > queries.
> >     > > I don't quite understand how exactly Athena not support Hudi as
> it is
> >     > > Presto underlying.
> >     > > Perhaps @Udit could give some insights from AWS?
> >     > >
> >     > > As @Raghvendra you mentioned, another option is to export Hudi
> dataset to
> >     > > plain parquet files for Athena to query on
> >     > > RFC-9 is for this usecase
> >     > >
> >     > >
> >     >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
> >     > > The task is inactive now. Feel free to pick up if this is
> something you'd
> >     > > like to work on. I'd be happy to help with that.
> >     > >
> >     > >
> >     > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <
> vinoth@apache.org>
> >     > wrote:
> >     > >
> >     > > > Hi Raghvendra,
> >     > > >
> >     > > > Quick sidebar.. Please subscribe to the mailing list, so your
> message
> >     > > > get published automatically. :)
> >     > > >
> >     > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
> >     > > > <ra...@delhivery.com.invalid> wrote:
> >     > > >
> >     > > > > Hi Udit,
> >     > > > >
> >     > > > > Thanks for information.
> >     > > > > Actually I am struggling on following points
> >     > > > > 1 - How can we process S3 parquet files(hourly partitioned)
> through
> >     > > > Apache
> >     > > > > Hudi? Is there any streaming layer we need to introduce? 2 -
> Is
> >     > > > > there any workaround to query Hudi Dataset from Athena? we
> are
> >     > > > > thinking to dump resulting Hudi dataset to S3, and then
> querying
> >     > > > > from Athena. 3 - What should be the parquet file size and
> row group
> >     > > > > size for better performance on querying Hudi Dataset?
> >     > > > >
> >     > > > > Thanks
> >     > > > > Raghvendra
> >     > > > >
> >     > > > >
> >     > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <
> uditme@amazon.com>
> >     > > > wrote:
> >     > > > >
> >     > > > > > Hi Raghvendra,
> >     > > > > >
> >     > > > > > You would have to re-write you Parquet Dataset in Hudi
> format.
> >     > > > > > Here are the links you can follow to get started:
> >     > > > > >
> >     > > > > >
> >     > > > >
> >     > > >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
> >     > > > -dataset.html
> >     > > > > >
> https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
> >     > > > > >
> >     > > > > > Thanks,
> >     > > > > > Udit
> >     > > > > >
> >     > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> >     > > > > > <ra...@delhivery.com.INVALID> wrote:
> >     > > > > >
> >     > > > > >     Hi Team,
> >     > > > > >
> >     > > > > >     I want to setup incremental view of my AWS S3 parquet
> data
> >     > > > > > through Apache
> >     > > > > >     Hudi, and want to query this data through Athena, but
> >     > > > > > currently
> >     > > > > Athena
> >     > > > > > not
> >     > > > > >     supporting Hudi Dataset.
> >     > > > > >
> >     > > > > >     so there are few questions which I want to understand
> here
> >     > > > > >
> >     > > > > >     1 - How to stream s3 parquet file to Hudi dataset
> running on
> >     > EMR.
> >     > > > > >
> >     > > > > >     2 - How to query Hudi Dataset running on EMR
> >     > > > > >
> >     > > > > >     Please help me to understand this.
> >     > > > > >
> >     > > > > >     Thanks
> >     > > > > >
> >     > > > > >     Raghvendra
> >     > > > > >
> >     > > > > >
> >     > > > > >
> >     > > > >
> >     > > >
> >     > >
> >     >
> >
> >
> >
>

Re: Apache Hudi on AWS EMR

Posted by Raghvendra Dhar Dubey <ra...@delhivery.com.INVALID>.
Thanks Gary and Udit,

I tried HudiDeltaStreamer for reading parquet files from s3  but there is an issue while AvroSchemaConverter not able to convert Parquet INT96. so I thought to use Spark Structured Streaming to read data from s3 and write into Hudi, but as Databricks providing "cloudfiles" for failure handling, Is there something in EMR? or do we need to manually handle this failure by introducing SQS and SNS?



On 2020/02/18 20:03:16, "Mehrotra, Udit" <ud...@amazon.com.INVALID> wrote: 
> Workaround provided by Gary can help querying Hudi tables through Athena for Copy On Write tables by basically querying only the latest commit files as standard parquet. It would definitely be worth documenting, as several people have asked for it and I remember providing the same suggestion on slack earlier. I can add if I have the perms.
> 
> >> if I connect to the Hive catalog on EMR, which is able to provide the
>     Hudi views correctly, I should be able to get correct results on Athena
> 
> As Vinoth mentioned, just connecting to metastore is not enough. Athena would still use its own Presto which does not support Hudi.
> 
> As for Hudi support for Athena:
> Athena does use Presto, but it's their own custom version and I don't think they yet have the code that Hudi guys contributed to presto i.e. the split annotations etc. Also they don’t have Hudi jars in presto classpath. We are not sure of any timelines for this support, but I have heard that work should start soon.
> 
> Thanks,
> Udit
> 
> On 2/18/20, 11:27 AM, "Vinoth Chandar" <vi...@apache.org> wrote:
> 
>     Thanks everyone for chiming in. Esp Gary for the detailed workaround..
>     (should we FAQ this workaround.. food for thought)
>     
>     >> if I connect to the Hive catalog on EMR, which is able to provide the
>     Hudi views correctly, I should be able to get correct results on Athena
>     
>     Knowing how the Presto/Hudi integration works, simply being able to read
>     from Hive metastore is not enough. Presto has code to specially recognize
>     Hudi tables and does an additional filtering step, which lets it query the
>     data in there correctly. (Gary's workaround above keeps just 1 version
>     around for a given file (group))..
>     
>     On Mon, Feb 17, 2020 at 11:28 PM Gary Li <ya...@gmail.com> wrote:
>     
>     > Hello, I don't have any experience working with Athena but I can share my
>     > experience working with Impala. There is a workaround.
>     > By setting Hudi config:
>     >
>     >    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
>     >    - hoodie.cleaner.fileversions.retained=1
>     >
>     > You will have your Hudi dataset as same as plain parquet files. You can
>     > create a table just like regular parquet. Hudi will write a new commit
>     > first then delete the older files that have two versions. You need to
>     > refresh the table metadata store as soon as the Hudi Upsert job finishes.
>     > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed the older
>     > files and before refresh the table metastore, the table will be unavailable
>     > for query(1-5 mins in my case).
>     >
>     > How can we process S3 parquet files(hourly partitioned) through Apache
>     > Hudi? Is there any streaming layer we need to introduce?
>     > -----------
>     > Hudi Delta streamer support parquet file. You can do a bulkInsert for the
>     > first job then use delta streamer for the Upsert job.
>     >
>     > 3 - What should be the parquet file size and row group size for better
>     > performance on querying Hudi Dataset?
>     > ----------
>     > That depends on the query engine you are using and it should be documented
>     > somewhere. For impala, the optimal size for query performance is 256MB, but
>     > the larger file size will make upsert more expensive. The size I personally
>     > choose is 100MB to 128MB.
>     >
>     > Thanks,
>     > Gary
>     >
>     >
>     >
>     > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu <ra...@amazon.com.invalid>
>     > wrote:
>     >
>     > > Athena is indeed Presto inside, but there is lot of custom code which has
>     > > gone on top of Presto there.
>     > > Couple months back I tried running a glue crawler to catalog a Hudi data
>     > > set and then query it from Athena. The results were not same as what I
>     > > would get with running the same query using spark SQL on EMR. Did not try
>     > > Presto on EMR, but assuming it will work fine on EMR.
>     > >
>     > > Athena integration with Hudi data set is planned shortly, but not sure of
>     > > the date yet.
>     > >
>     > > However, recently Athena started supporting integration to a Hive catalog
>     > > apart from Glue. What that means is in Athena, if I connect to the Hive
>     > > catalog on EMR, which is able to provide the Hudi views correctly, I
>     > should
>     > > be able to get correct results on Athena. Have not tested it though. The
>     > > feature is in Preview already.
>     > >
>     > > Thanks
>     > > Raghu
>     > > -----Original Message-----
>     > > From: Shiyan Xu <xu...@gmail.com>
>     > > Sent: Tuesday, February 18, 2020 6:20 AM
>     > > To: dev@hudi.apache.org
>     > > Cc: Mehrotra, Udit <ud...@amazon.com>; Raghvendra Dhar Dubey
>     > > <ra...@delhivery.com.invalid>
>     > > Subject: Re: Apache Hudi on AWS EMR
>     > >
>     > > For 2) I think running presto on EMR is able to let you run
>     > read-optimized
>     > > queries.
>     > > I don't quite understand how exactly Athena not support Hudi as it is
>     > > Presto underlying.
>     > > Perhaps @Udit could give some insights from AWS?
>     > >
>     > > As @Raghvendra you mentioned, another option is to export Hudi dataset to
>     > > plain parquet files for Athena to query on
>     > > RFC-9 is for this usecase
>     > >
>     > >
>     > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
>     > > The task is inactive now. Feel free to pick up if this is something you'd
>     > > like to work on. I'd be happy to help with that.
>     > >
>     > >
>     > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <vi...@apache.org>
>     > wrote:
>     > >
>     > > > Hi Raghvendra,
>     > > >
>     > > > Quick sidebar.. Please subscribe to the mailing list, so your message
>     > > > get published automatically. :)
>     > > >
>     > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
>     > > > <ra...@delhivery.com.invalid> wrote:
>     > > >
>     > > > > Hi Udit,
>     > > > >
>     > > > > Thanks for information.
>     > > > > Actually I am struggling on following points
>     > > > > 1 - How can we process S3 parquet files(hourly partitioned) through
>     > > > Apache
>     > > > > Hudi? Is there any streaming layer we need to introduce? 2 - Is
>     > > > > there any workaround to query Hudi Dataset from Athena? we are
>     > > > > thinking to dump resulting Hudi dataset to S3, and then querying
>     > > > > from Athena. 3 - What should be the parquet file size and row group
>     > > > > size for better performance on querying Hudi Dataset?
>     > > > >
>     > > > > Thanks
>     > > > > Raghvendra
>     > > > >
>     > > > >
>     > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <ud...@amazon.com>
>     > > > wrote:
>     > > > >
>     > > > > > Hi Raghvendra,
>     > > > > >
>     > > > > > You would have to re-write you Parquet Dataset in Hudi format.
>     > > > > > Here are the links you can follow to get started:
>     > > > > >
>     > > > > >
>     > > > >
>     > > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
>     > > > -dataset.html
>     > > > > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
>     > > > > >
>     > > > > > Thanks,
>     > > > > > Udit
>     > > > > >
>     > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
>     > > > > > <ra...@delhivery.com.INVALID> wrote:
>     > > > > >
>     > > > > >     Hi Team,
>     > > > > >
>     > > > > >     I want to setup incremental view of my AWS S3 parquet data
>     > > > > > through Apache
>     > > > > >     Hudi, and want to query this data through Athena, but
>     > > > > > currently
>     > > > > Athena
>     > > > > > not
>     > > > > >     supporting Hudi Dataset.
>     > > > > >
>     > > > > >     so there are few questions which I want to understand here
>     > > > > >
>     > > > > >     1 - How to stream s3 parquet file to Hudi dataset running on
>     > EMR.
>     > > > > >
>     > > > > >     2 - How to query Hudi Dataset running on EMR
>     > > > > >
>     > > > > >     Please help me to understand this.
>     > > > > >
>     > > > > >     Thanks
>     > > > > >
>     > > > > >     Raghvendra
>     > > > > >
>     > > > > >
>     > > > > >
>     > > > >
>     > > >
>     > >
>     >
>     
> 
> 

Re: Apache Hudi on AWS EMR

Posted by Bhavani Sudha Saktheeswaran <bh...@uber.com.INVALID>.
Got it. Thanks Udit!

On Wed, Feb 19, 2020 at 2:12 PM Mehrotra, Udit <ud...@amazon.com.invalid>
wrote:

> Hi Sudha,
>
> Yes EMR Presto since 5.28.0 release comes with presto jars present in the
> classpath. If you launch a cluster with Presto you should see it at:
>
> /usr/lib/presto/plugin/hive-hadoop2/hudi-presto-bundle.jar
>
> Thanks,
> Udit
>
>
> On 2/19/20, 1:53 PM, "Bhavani Sudha" <bh...@gmail.com> wrote:
>
>     Hi Udit,
>
>     Just a quick question on Presto EMR. Does EMR Presto support Hudi jars
> in
>     its classpath ?
>
>     On Tue, Feb 18, 2020 at 12:03 PM Mehrotra, Udit
> <ud...@amazon.com.invalid>
>     wrote:
>
>     > Workaround provided by Gary can help querying Hudi tables through
> Athena
>     > for Copy On Write tables by basically querying only the latest
> commit files
>     > as standard parquet. It would definitely be worth documenting, as
> several
>     > people have asked for it and I remember providing the same
> suggestion on
>     > slack earlier. I can add if I have the perms.
>     >
>     > >> if I connect to the Hive catalog on EMR, which is able to provide
> the
>     >     Hudi views correctly, I should be able to get correct results on
> Athena
>     >
>     > As Vinoth mentioned, just connecting to metastore is not enough.
> Athena
>     > would still use its own Presto which does not support Hudi.
>     >
>     > As for Hudi support for Athena:
>     > Athena does use Presto, but it's their own custom version and I don't
>     > think they yet have the code that Hudi guys contributed to presto
> i.e. the
>     > split annotations etc. Also they don’t have Hudi jars in presto
> classpath.
>     > We are not sure of any timelines for this support, but I have heard
> that
>     > work should start soon.
>     >
>     > Thanks,
>     > Udit
>     >
>     > On 2/18/20, 11:27 AM, "Vinoth Chandar" <vi...@apache.org> wrote:
>     >
>     >     Thanks everyone for chiming in. Esp Gary for the detailed
> workaround..
>     >     (should we FAQ this workaround.. food for thought)
>     >
>     >     >> if I connect to the Hive catalog on EMR, which is able to
> provide
>     > the
>     >     Hudi views correctly, I should be able to get correct results on
> Athena
>     >
>     >     Knowing how the Presto/Hudi integration works, simply being able
> to
>     > read
>     >     from Hive metastore is not enough. Presto has code to specially
>     > recognize
>     >     Hudi tables and does an additional filtering step, which lets it
> query
>     > the
>     >     data in there correctly. (Gary's workaround above keeps just 1
> version
>     >     around for a given file (group))..
>     >
>     >     On Mon, Feb 17, 2020 at 11:28 PM Gary Li <
> yanjia.gary.li@gmail.com>
>     > wrote:
>     >
>     >     > Hello, I don't have any experience working with Athena but I
> can
>     > share my
>     >     > experience working with Impala. There is a workaround.
>     >     > By setting Hudi config:
>     >     >
>     >     >    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
>     >     >    - hoodie.cleaner.fileversions.retained=1
>     >     >
>     >     > You will have your Hudi dataset as same as plain parquet
> files. You
>     > can
>     >     > create a table just like regular parquet. Hudi will write a new
>     > commit
>     >     > first then delete the older files that have two versions. You
> need to
>     >     > refresh the table metadata store as soon as the Hudi Upsert job
>     > finishes.
>     >     > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed
> the
>     > older
>     >     > files and before refresh the table metastore, the table will be
>     > unavailable
>     >     > for query(1-5 mins in my case).
>     >     >
>     >     > How can we process S3 parquet files(hourly partitioned) through
>     > Apache
>     >     > Hudi? Is there any streaming layer we need to introduce?
>     >     > -----------
>     >     > Hudi Delta streamer support parquet file. You can do a
> bulkInsert
>     > for the
>     >     > first job then use delta streamer for the Upsert job.
>     >     >
>     >     > 3 - What should be the parquet file size and row group size for
>     > better
>     >     > performance on querying Hudi Dataset?
>     >     > ----------
>     >     > That depends on the query engine you are using and it should be
>     > documented
>     >     > somewhere. For impala, the optimal size for query performance
> is
>     > 256MB, but
>     >     > the larger file size will make upsert more expensive. The size
> I
>     > personally
>     >     > choose is 100MB to 128MB.
>     >     >
>     >     > Thanks,
>     >     > Gary
>     >     >
>     >     >
>     >     >
>     >     > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
>     > <ra...@amazon.com.invalid>
>     >     > wrote:
>     >     >
>     >     > > Athena is indeed Presto inside, but there is lot of custom
> code
>     > which has
>     >     > > gone on top of Presto there.
>     >     > > Couple months back I tried running a glue crawler to catalog
> a
>     > Hudi data
>     >     > > set and then query it from Athena. The results were not same
> as
>     > what I
>     >     > > would get with running the same query using spark SQL on
> EMR. Did
>     > not try
>     >     > > Presto on EMR, but assuming it will work fine on EMR.
>     >     > >
>     >     > > Athena integration with Hudi data set is planned shortly,
> but not
>     > sure of
>     >     > > the date yet.
>     >     > >
>     >     > > However, recently Athena started supporting integration to a
> Hive
>     > catalog
>     >     > > apart from Glue. What that means is in Athena, if I connect
> to the
>     > Hive
>     >     > > catalog on EMR, which is able to provide the Hudi views
> correctly,
>     > I
>     >     > should
>     >     > > be able to get correct results on Athena. Have not tested it
>     > though. The
>     >     > > feature is in Preview already.
>     >     > >
>     >     > > Thanks
>     >     > > Raghu
>     >     > > -----Original Message-----
>     >     > > From: Shiyan Xu <xu...@gmail.com>
>     >     > > Sent: Tuesday, February 18, 2020 6:20 AM
>     >     > > To: dev@hudi.apache.org
>     >     > > Cc: Mehrotra, Udit <ud...@amazon.com>; Raghvendra Dhar
> Dubey
>     >     > > <ra...@delhivery.com.invalid>
>     >     > > Subject: Re: Apache Hudi on AWS EMR
>     >     > >
>     >     > > For 2) I think running presto on EMR is able to let you run
>     >     > read-optimized
>     >     > > queries.
>     >     > > I don't quite understand how exactly Athena not support Hudi
> as it
>     > is
>     >     > > Presto underlying.
>     >     > > Perhaps @Udit could give some insights from AWS?
>     >     > >
>     >     > > As @Raghvendra you mentioned, another option is to export
> Hudi
>     > dataset to
>     >     > > plain parquet files for Athena to query on
>     >     > > RFC-9 is for this usecase
>     >     > >
>     >     > >
>     >     >
>     >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_HUDI_RFC-2B-2D-2B09-2B-253A-2BHudi-2BDataset-2BSnapshot-2BExporter&d=DwIGaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=cAQXyjA_Xn47onz2XL3pETo9F8tJpQme4MG40ofhE2Y&s=Q9eYberH-EOZD1GgTVtyIDxF23q-qlbre8FY1LWgfWQ&e=
>     >     > > The task is inactive now. Feel free to pick up if this is
>     > something you'd
>     >     > > like to work on. I'd be happy to help with that.
>     >     > >
>     >     > >
>     >     > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <
> vinoth@apache.org>
>     >     > wrote:
>     >     > >
>     >     > > > Hi Raghvendra,
>     >     > > >
>     >     > > > Quick sidebar.. Please subscribe to the mailing list, so
> your
>     > message
>     >     > > > get published automatically. :)
>     >     > > >
>     >     > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
>     >     > > > <ra...@delhivery.com.invalid> wrote:
>     >     > > >
>     >     > > > > Hi Udit,
>     >     > > > >
>     >     > > > > Thanks for information.
>     >     > > > > Actually I am struggling on following points
>     >     > > > > 1 - How can we process S3 parquet files(hourly
> partitioned)
>     > through
>     >     > > > Apache
>     >     > > > > Hudi? Is there any streaming layer we need to introduce?
> 2 - Is
>     >     > > > > there any workaround to query Hudi Dataset from Athena?
> we are
>     >     > > > > thinking to dump resulting Hudi dataset to S3, and then
>     > querying
>     >     > > > > from Athena. 3 - What should be the parquet file size
> and row
>     > group
>     >     > > > > size for better performance on querying Hudi Dataset?
>     >     > > > >
>     >     > > > > Thanks
>     >     > > > > Raghvendra
>     >     > > > >
>     >     > > > >
>     >     > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <
>     > uditme@amazon.com>
>     >     > > > wrote:
>     >     > > > >
>     >     > > > > > Hi Raghvendra,
>     >     > > > > >
>     >     > > > > > You would have to re-write you Parquet Dataset in Hudi
>     > format.
>     >     > > > > > Here are the links you can follow to get started:
>     >     > > > > >
>     >     > > > > >
>     >     > > > >
>     >     > > >
>     >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.aws.amazon.com_emr_latest_ReleaseGuide_emr-2Dhudi-2Dwork-2Dwith&d=DwIGaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=cAQXyjA_Xn47onz2XL3pETo9F8tJpQme4MG40ofhE2Y&s=Fi5rpN7yxjUjbZd-YPjS2Rumt8HMwDfDQRWiE7QEGHI&e=
>     >     > > > -dataset.html
>     >     > > > > >
>     >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hudi.apache.org_docs_querying-5Fdata.html-23spark-2Dincr-2Dpull&d=DwIGaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=cAQXyjA_Xn47onz2XL3pETo9F8tJpQme4MG40ofhE2Y&s=_vmi9HrxnCH3vsr2PztvD3Su7Qsweb8Iw7CzqmCyyY8&e=
>     >     > > > > >
>     >     > > > > > Thanks,
>     >     > > > > > Udit
>     >     > > > > >
>     >     > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
>     >     > > > > > <ra...@delhivery.com.INVALID> wrote:
>     >     > > > > >
>     >     > > > > >     Hi Team,
>     >     > > > > >
>     >     > > > > >     I want to setup incremental view of my AWS S3
> parquet
>     > data
>     >     > > > > > through Apache
>     >     > > > > >     Hudi, and want to query this data through Athena,
> but
>     >     > > > > > currently
>     >     > > > > Athena
>     >     > > > > > not
>     >     > > > > >     supporting Hudi Dataset.
>     >     > > > > >
>     >     > > > > >     so there are few questions which I want to
> understand
>     > here
>     >     > > > > >
>     >     > > > > >     1 - How to stream s3 parquet file to Hudi dataset
>     > running on
>     >     > EMR.
>     >     > > > > >
>     >     > > > > >     2 - How to query Hudi Dataset running on EMR
>     >     > > > > >
>     >     > > > > >     Please help me to understand this.
>     >     > > > > >
>     >     > > > > >     Thanks
>     >     > > > > >
>     >     > > > > >     Raghvendra
>     >     > > > > >
>     >     > > > > >
>     >     > > > > >
>     >     > > > >
>     >     > > >
>     >     > >
>     >     >
>     >
>     >
>     >
>
>
>

Re: Apache Hudi on AWS EMR

Posted by "Mehrotra, Udit" <ud...@amazon.com.INVALID>.
Hi Sudha,

Yes EMR Presto since 5.28.0 release comes with presto jars present in the classpath. If you launch a cluster with Presto you should see it at:

/usr/lib/presto/plugin/hive-hadoop2/hudi-presto-bundle.jar

Thanks,
Udit


On 2/19/20, 1:53 PM, "Bhavani Sudha" <bh...@gmail.com> wrote:

    Hi Udit,
    
    Just a quick question on Presto EMR. Does EMR Presto support Hudi jars in
    its classpath ?
    
    On Tue, Feb 18, 2020 at 12:03 PM Mehrotra, Udit <ud...@amazon.com.invalid>
    wrote:
    
    > Workaround provided by Gary can help querying Hudi tables through Athena
    > for Copy On Write tables by basically querying only the latest commit files
    > as standard parquet. It would definitely be worth documenting, as several
    > people have asked for it and I remember providing the same suggestion on
    > slack earlier. I can add if I have the perms.
    >
    > >> if I connect to the Hive catalog on EMR, which is able to provide the
    >     Hudi views correctly, I should be able to get correct results on Athena
    >
    > As Vinoth mentioned, just connecting to metastore is not enough. Athena
    > would still use its own Presto which does not support Hudi.
    >
    > As for Hudi support for Athena:
    > Athena does use Presto, but it's their own custom version and I don't
    > think they yet have the code that Hudi guys contributed to presto i.e. the
    > split annotations etc. Also they don’t have Hudi jars in presto classpath.
    > We are not sure of any timelines for this support, but I have heard that
    > work should start soon.
    >
    > Thanks,
    > Udit
    >
    > On 2/18/20, 11:27 AM, "Vinoth Chandar" <vi...@apache.org> wrote:
    >
    >     Thanks everyone for chiming in. Esp Gary for the detailed workaround..
    >     (should we FAQ this workaround.. food for thought)
    >
    >     >> if I connect to the Hive catalog on EMR, which is able to provide
    > the
    >     Hudi views correctly, I should be able to get correct results on Athena
    >
    >     Knowing how the Presto/Hudi integration works, simply being able to
    > read
    >     from Hive metastore is not enough. Presto has code to specially
    > recognize
    >     Hudi tables and does an additional filtering step, which lets it query
    > the
    >     data in there correctly. (Gary's workaround above keeps just 1 version
    >     around for a given file (group))..
    >
    >     On Mon, Feb 17, 2020 at 11:28 PM Gary Li <ya...@gmail.com>
    > wrote:
    >
    >     > Hello, I don't have any experience working with Athena but I can
    > share my
    >     > experience working with Impala. There is a workaround.
    >     > By setting Hudi config:
    >     >
    >     >    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
    >     >    - hoodie.cleaner.fileversions.retained=1
    >     >
    >     > You will have your Hudi dataset as same as plain parquet files. You
    > can
    >     > create a table just like regular parquet. Hudi will write a new
    > commit
    >     > first then delete the older files that have two versions. You need to
    >     > refresh the table metadata store as soon as the Hudi Upsert job
    > finishes.
    >     > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed the
    > older
    >     > files and before refresh the table metastore, the table will be
    > unavailable
    >     > for query(1-5 mins in my case).
    >     >
    >     > How can we process S3 parquet files(hourly partitioned) through
    > Apache
    >     > Hudi? Is there any streaming layer we need to introduce?
    >     > -----------
    >     > Hudi Delta streamer support parquet file. You can do a bulkInsert
    > for the
    >     > first job then use delta streamer for the Upsert job.
    >     >
    >     > 3 - What should be the parquet file size and row group size for
    > better
    >     > performance on querying Hudi Dataset?
    >     > ----------
    >     > That depends on the query engine you are using and it should be
    > documented
    >     > somewhere. For impala, the optimal size for query performance is
    > 256MB, but
    >     > the larger file size will make upsert more expensive. The size I
    > personally
    >     > choose is 100MB to 128MB.
    >     >
    >     > Thanks,
    >     > Gary
    >     >
    >     >
    >     >
    >     > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
    > <ra...@amazon.com.invalid>
    >     > wrote:
    >     >
    >     > > Athena is indeed Presto inside, but there is lot of custom code
    > which has
    >     > > gone on top of Presto there.
    >     > > Couple months back I tried running a glue crawler to catalog a
    > Hudi data
    >     > > set and then query it from Athena. The results were not same as
    > what I
    >     > > would get with running the same query using spark SQL on EMR. Did
    > not try
    >     > > Presto on EMR, but assuming it will work fine on EMR.
    >     > >
    >     > > Athena integration with Hudi data set is planned shortly, but not
    > sure of
    >     > > the date yet.
    >     > >
    >     > > However, recently Athena started supporting integration to a Hive
    > catalog
    >     > > apart from Glue. What that means is in Athena, if I connect to the
    > Hive
    >     > > catalog on EMR, which is able to provide the Hudi views correctly,
    > I
    >     > should
    >     > > be able to get correct results on Athena. Have not tested it
    > though. The
    >     > > feature is in Preview already.
    >     > >
    >     > > Thanks
    >     > > Raghu
    >     > > -----Original Message-----
    >     > > From: Shiyan Xu <xu...@gmail.com>
    >     > > Sent: Tuesday, February 18, 2020 6:20 AM
    >     > > To: dev@hudi.apache.org
    >     > > Cc: Mehrotra, Udit <ud...@amazon.com>; Raghvendra Dhar Dubey
    >     > > <ra...@delhivery.com.invalid>
    >     > > Subject: Re: Apache Hudi on AWS EMR
    >     > >
    >     > > For 2) I think running presto on EMR is able to let you run
    >     > read-optimized
    >     > > queries.
    >     > > I don't quite understand how exactly Athena not support Hudi as it
    > is
    >     > > Presto underlying.
    >     > > Perhaps @Udit could give some insights from AWS?
    >     > >
    >     > > As @Raghvendra you mentioned, another option is to export Hudi
    > dataset to
    >     > > plain parquet files for Athena to query on
    >     > > RFC-9 is for this usecase
    >     > >
    >     > >
    >     >
    > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
    >     > > The task is inactive now. Feel free to pick up if this is
    > something you'd
    >     > > like to work on. I'd be happy to help with that.
    >     > >
    >     > >
    >     > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <vi...@apache.org>
    >     > wrote:
    >     > >
    >     > > > Hi Raghvendra,
    >     > > >
    >     > > > Quick sidebar.. Please subscribe to the mailing list, so your
    > message
    >     > > > get published automatically. :)
    >     > > >
    >     > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
    >     > > > <ra...@delhivery.com.invalid> wrote:
    >     > > >
    >     > > > > Hi Udit,
    >     > > > >
    >     > > > > Thanks for information.
    >     > > > > Actually I am struggling on following points
    >     > > > > 1 - How can we process S3 parquet files(hourly partitioned)
    > through
    >     > > > Apache
    >     > > > > Hudi? Is there any streaming layer we need to introduce? 2 - Is
    >     > > > > there any workaround to query Hudi Dataset from Athena? we are
    >     > > > > thinking to dump resulting Hudi dataset to S3, and then
    > querying
    >     > > > > from Athena. 3 - What should be the parquet file size and row
    > group
    >     > > > > size for better performance on querying Hudi Dataset?
    >     > > > >
    >     > > > > Thanks
    >     > > > > Raghvendra
    >     > > > >
    >     > > > >
    >     > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <
    > uditme@amazon.com>
    >     > > > wrote:
    >     > > > >
    >     > > > > > Hi Raghvendra,
    >     > > > > >
    >     > > > > > You would have to re-write you Parquet Dataset in Hudi
    > format.
    >     > > > > > Here are the links you can follow to get started:
    >     > > > > >
    >     > > > > >
    >     > > > >
    >     > > >
    > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
    >     > > > -dataset.html
    >     > > > > >
    > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
    >     > > > > >
    >     > > > > > Thanks,
    >     > > > > > Udit
    >     > > > > >
    >     > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
    >     > > > > > <ra...@delhivery.com.INVALID> wrote:
    >     > > > > >
    >     > > > > >     Hi Team,
    >     > > > > >
    >     > > > > >     I want to setup incremental view of my AWS S3 parquet
    > data
    >     > > > > > through Apache
    >     > > > > >     Hudi, and want to query this data through Athena, but
    >     > > > > > currently
    >     > > > > Athena
    >     > > > > > not
    >     > > > > >     supporting Hudi Dataset.
    >     > > > > >
    >     > > > > >     so there are few questions which I want to understand
    > here
    >     > > > > >
    >     > > > > >     1 - How to stream s3 parquet file to Hudi dataset
    > running on
    >     > EMR.
    >     > > > > >
    >     > > > > >     2 - How to query Hudi Dataset running on EMR
    >     > > > > >
    >     > > > > >     Please help me to understand this.
    >     > > > > >
    >     > > > > >     Thanks
    >     > > > > >
    >     > > > > >     Raghvendra
    >     > > > > >
    >     > > > > >
    >     > > > > >
    >     > > > >
    >     > > >
    >     > >
    >     >
    >
    >
    >
    


Re: Apache Hudi on AWS EMR

Posted by Bhavani Sudha <bh...@gmail.com>.
Hi Udit,

Just a quick question on Presto EMR. Does EMR Presto support Hudi jars in
its classpath ?

On Tue, Feb 18, 2020 at 12:03 PM Mehrotra, Udit <ud...@amazon.com.invalid>
wrote:

> Workaround provided by Gary can help querying Hudi tables through Athena
> for Copy On Write tables by basically querying only the latest commit files
> as standard parquet. It would definitely be worth documenting, as several
> people have asked for it and I remember providing the same suggestion on
> slack earlier. I can add if I have the perms.
>
> >> if I connect to the Hive catalog on EMR, which is able to provide the
>     Hudi views correctly, I should be able to get correct results on Athena
>
> As Vinoth mentioned, just connecting to metastore is not enough. Athena
> would still use its own Presto which does not support Hudi.
>
> As for Hudi support for Athena:
> Athena does use Presto, but it's their own custom version and I don't
> think they yet have the code that Hudi guys contributed to presto i.e. the
> split annotations etc. Also they don’t have Hudi jars in presto classpath.
> We are not sure of any timelines for this support, but I have heard that
> work should start soon.
>
> Thanks,
> Udit
>
> On 2/18/20, 11:27 AM, "Vinoth Chandar" <vi...@apache.org> wrote:
>
>     Thanks everyone for chiming in. Esp Gary for the detailed workaround..
>     (should we FAQ this workaround.. food for thought)
>
>     >> if I connect to the Hive catalog on EMR, which is able to provide
> the
>     Hudi views correctly, I should be able to get correct results on Athena
>
>     Knowing how the Presto/Hudi integration works, simply being able to
> read
>     from Hive metastore is not enough. Presto has code to specially
> recognize
>     Hudi tables and does an additional filtering step, which lets it query
> the
>     data in there correctly. (Gary's workaround above keeps just 1 version
>     around for a given file (group))..
>
>     On Mon, Feb 17, 2020 at 11:28 PM Gary Li <ya...@gmail.com>
> wrote:
>
>     > Hello, I don't have any experience working with Athena but I can
> share my
>     > experience working with Impala. There is a workaround.
>     > By setting Hudi config:
>     >
>     >    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
>     >    - hoodie.cleaner.fileversions.retained=1
>     >
>     > You will have your Hudi dataset as same as plain parquet files. You
> can
>     > create a table just like regular parquet. Hudi will write a new
> commit
>     > first then delete the older files that have two versions. You need to
>     > refresh the table metadata store as soon as the Hudi Upsert job
> finishes.
>     > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed the
> older
>     > files and before refresh the table metastore, the table will be
> unavailable
>     > for query(1-5 mins in my case).
>     >
>     > How can we process S3 parquet files(hourly partitioned) through
> Apache
>     > Hudi? Is there any streaming layer we need to introduce?
>     > -----------
>     > Hudi Delta streamer support parquet file. You can do a bulkInsert
> for the
>     > first job then use delta streamer for the Upsert job.
>     >
>     > 3 - What should be the parquet file size and row group size for
> better
>     > performance on querying Hudi Dataset?
>     > ----------
>     > That depends on the query engine you are using and it should be
> documented
>     > somewhere. For impala, the optimal size for query performance is
> 256MB, but
>     > the larger file size will make upsert more expensive. The size I
> personally
>     > choose is 100MB to 128MB.
>     >
>     > Thanks,
>     > Gary
>     >
>     >
>     >
>     > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
> <ra...@amazon.com.invalid>
>     > wrote:
>     >
>     > > Athena is indeed Presto inside, but there is lot of custom code
> which has
>     > > gone on top of Presto there.
>     > > Couple months back I tried running a glue crawler to catalog a
> Hudi data
>     > > set and then query it from Athena. The results were not same as
> what I
>     > > would get with running the same query using spark SQL on EMR. Did
> not try
>     > > Presto on EMR, but assuming it will work fine on EMR.
>     > >
>     > > Athena integration with Hudi data set is planned shortly, but not
> sure of
>     > > the date yet.
>     > >
>     > > However, recently Athena started supporting integration to a Hive
> catalog
>     > > apart from Glue. What that means is in Athena, if I connect to the
> Hive
>     > > catalog on EMR, which is able to provide the Hudi views correctly,
> I
>     > should
>     > > be able to get correct results on Athena. Have not tested it
> though. The
>     > > feature is in Preview already.
>     > >
>     > > Thanks
>     > > Raghu
>     > > -----Original Message-----
>     > > From: Shiyan Xu <xu...@gmail.com>
>     > > Sent: Tuesday, February 18, 2020 6:20 AM
>     > > To: dev@hudi.apache.org
>     > > Cc: Mehrotra, Udit <ud...@amazon.com>; Raghvendra Dhar Dubey
>     > > <ra...@delhivery.com.invalid>
>     > > Subject: Re: Apache Hudi on AWS EMR
>     > >
>     > > For 2) I think running presto on EMR is able to let you run
>     > read-optimized
>     > > queries.
>     > > I don't quite understand how exactly Athena not support Hudi as it
> is
>     > > Presto underlying.
>     > > Perhaps @Udit could give some insights from AWS?
>     > >
>     > > As @Raghvendra you mentioned, another option is to export Hudi
> dataset to
>     > > plain parquet files for Athena to query on
>     > > RFC-9 is for this usecase
>     > >
>     > >
>     >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
>     > > The task is inactive now. Feel free to pick up if this is
> something you'd
>     > > like to work on. I'd be happy to help with that.
>     > >
>     > >
>     > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <vi...@apache.org>
>     > wrote:
>     > >
>     > > > Hi Raghvendra,
>     > > >
>     > > > Quick sidebar.. Please subscribe to the mailing list, so your
> message
>     > > > get published automatically. :)
>     > > >
>     > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
>     > > > <ra...@delhivery.com.invalid> wrote:
>     > > >
>     > > > > Hi Udit,
>     > > > >
>     > > > > Thanks for information.
>     > > > > Actually I am struggling on following points
>     > > > > 1 - How can we process S3 parquet files(hourly partitioned)
> through
>     > > > Apache
>     > > > > Hudi? Is there any streaming layer we need to introduce? 2 - Is
>     > > > > there any workaround to query Hudi Dataset from Athena? we are
>     > > > > thinking to dump resulting Hudi dataset to S3, and then
> querying
>     > > > > from Athena. 3 - What should be the parquet file size and row
> group
>     > > > > size for better performance on querying Hudi Dataset?
>     > > > >
>     > > > > Thanks
>     > > > > Raghvendra
>     > > > >
>     > > > >
>     > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <
> uditme@amazon.com>
>     > > > wrote:
>     > > > >
>     > > > > > Hi Raghvendra,
>     > > > > >
>     > > > > > You would have to re-write you Parquet Dataset in Hudi
> format.
>     > > > > > Here are the links you can follow to get started:
>     > > > > >
>     > > > > >
>     > > > >
>     > > >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
>     > > > -dataset.html
>     > > > > >
> https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
>     > > > > >
>     > > > > > Thanks,
>     > > > > > Udit
>     > > > > >
>     > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
>     > > > > > <ra...@delhivery.com.INVALID> wrote:
>     > > > > >
>     > > > > >     Hi Team,
>     > > > > >
>     > > > > >     I want to setup incremental view of my AWS S3 parquet
> data
>     > > > > > through Apache
>     > > > > >     Hudi, and want to query this data through Athena, but
>     > > > > > currently
>     > > > > Athena
>     > > > > > not
>     > > > > >     supporting Hudi Dataset.
>     > > > > >
>     > > > > >     so there are few questions which I want to understand
> here
>     > > > > >
>     > > > > >     1 - How to stream s3 parquet file to Hudi dataset
> running on
>     > EMR.
>     > > > > >
>     > > > > >     2 - How to query Hudi Dataset running on EMR
>     > > > > >
>     > > > > >     Please help me to understand this.
>     > > > > >
>     > > > > >     Thanks
>     > > > > >
>     > > > > >     Raghvendra
>     > > > > >
>     > > > > >
>     > > > > >
>     > > > >
>     > > >
>     > >
>     >
>
>
>

Re: Apache Hudi on AWS EMR

Posted by "Mehrotra, Udit" <ud...@amazon.com.INVALID>.
Workaround provided by Gary can help querying Hudi tables through Athena for Copy On Write tables by basically querying only the latest commit files as standard parquet. It would definitely be worth documenting, as several people have asked for it and I remember providing the same suggestion on slack earlier. I can add if I have the perms.

>> if I connect to the Hive catalog on EMR, which is able to provide the
    Hudi views correctly, I should be able to get correct results on Athena

As Vinoth mentioned, just connecting to metastore is not enough. Athena would still use its own Presto which does not support Hudi.

As for Hudi support for Athena:
Athena does use Presto, but it's their own custom version and I don't think they yet have the code that Hudi guys contributed to presto i.e. the split annotations etc. Also they don’t have Hudi jars in presto classpath. We are not sure of any timelines for this support, but I have heard that work should start soon.

Thanks,
Udit

On 2/18/20, 11:27 AM, "Vinoth Chandar" <vi...@apache.org> wrote:

    Thanks everyone for chiming in. Esp Gary for the detailed workaround..
    (should we FAQ this workaround.. food for thought)
    
    >> if I connect to the Hive catalog on EMR, which is able to provide the
    Hudi views correctly, I should be able to get correct results on Athena
    
    Knowing how the Presto/Hudi integration works, simply being able to read
    from Hive metastore is not enough. Presto has code to specially recognize
    Hudi tables and does an additional filtering step, which lets it query the
    data in there correctly. (Gary's workaround above keeps just 1 version
    around for a given file (group))..
    
    On Mon, Feb 17, 2020 at 11:28 PM Gary Li <ya...@gmail.com> wrote:
    
    > Hello, I don't have any experience working with Athena but I can share my
    > experience working with Impala. There is a workaround.
    > By setting Hudi config:
    >
    >    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
    >    - hoodie.cleaner.fileversions.retained=1
    >
    > You will have your Hudi dataset as same as plain parquet files. You can
    > create a table just like regular parquet. Hudi will write a new commit
    > first then delete the older files that have two versions. You need to
    > refresh the table metadata store as soon as the Hudi Upsert job finishes.
    > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed the older
    > files and before refresh the table metastore, the table will be unavailable
    > for query(1-5 mins in my case).
    >
    > How can we process S3 parquet files(hourly partitioned) through Apache
    > Hudi? Is there any streaming layer we need to introduce?
    > -----------
    > Hudi Delta streamer support parquet file. You can do a bulkInsert for the
    > first job then use delta streamer for the Upsert job.
    >
    > 3 - What should be the parquet file size and row group size for better
    > performance on querying Hudi Dataset?
    > ----------
    > That depends on the query engine you are using and it should be documented
    > somewhere. For impala, the optimal size for query performance is 256MB, but
    > the larger file size will make upsert more expensive. The size I personally
    > choose is 100MB to 128MB.
    >
    > Thanks,
    > Gary
    >
    >
    >
    > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu <ra...@amazon.com.invalid>
    > wrote:
    >
    > > Athena is indeed Presto inside, but there is lot of custom code which has
    > > gone on top of Presto there.
    > > Couple months back I tried running a glue crawler to catalog a Hudi data
    > > set and then query it from Athena. The results were not same as what I
    > > would get with running the same query using spark SQL on EMR. Did not try
    > > Presto on EMR, but assuming it will work fine on EMR.
    > >
    > > Athena integration with Hudi data set is planned shortly, but not sure of
    > > the date yet.
    > >
    > > However, recently Athena started supporting integration to a Hive catalog
    > > apart from Glue. What that means is in Athena, if I connect to the Hive
    > > catalog on EMR, which is able to provide the Hudi views correctly, I
    > should
    > > be able to get correct results on Athena. Have not tested it though. The
    > > feature is in Preview already.
    > >
    > > Thanks
    > > Raghu
    > > -----Original Message-----
    > > From: Shiyan Xu <xu...@gmail.com>
    > > Sent: Tuesday, February 18, 2020 6:20 AM
    > > To: dev@hudi.apache.org
    > > Cc: Mehrotra, Udit <ud...@amazon.com>; Raghvendra Dhar Dubey
    > > <ra...@delhivery.com.invalid>
    > > Subject: Re: Apache Hudi on AWS EMR
    > >
    > > For 2) I think running presto on EMR is able to let you run
    > read-optimized
    > > queries.
    > > I don't quite understand how exactly Athena not support Hudi as it is
    > > Presto underlying.
    > > Perhaps @Udit could give some insights from AWS?
    > >
    > > As @Raghvendra you mentioned, another option is to export Hudi dataset to
    > > plain parquet files for Athena to query on
    > > RFC-9 is for this usecase
    > >
    > >
    > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
    > > The task is inactive now. Feel free to pick up if this is something you'd
    > > like to work on. I'd be happy to help with that.
    > >
    > >
    > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <vi...@apache.org>
    > wrote:
    > >
    > > > Hi Raghvendra,
    > > >
    > > > Quick sidebar.. Please subscribe to the mailing list, so your message
    > > > get published automatically. :)
    > > >
    > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
    > > > <ra...@delhivery.com.invalid> wrote:
    > > >
    > > > > Hi Udit,
    > > > >
    > > > > Thanks for information.
    > > > > Actually I am struggling on following points
    > > > > 1 - How can we process S3 parquet files(hourly partitioned) through
    > > > Apache
    > > > > Hudi? Is there any streaming layer we need to introduce? 2 - Is
    > > > > there any workaround to query Hudi Dataset from Athena? we are
    > > > > thinking to dump resulting Hudi dataset to S3, and then querying
    > > > > from Athena. 3 - What should be the parquet file size and row group
    > > > > size for better performance on querying Hudi Dataset?
    > > > >
    > > > > Thanks
    > > > > Raghvendra
    > > > >
    > > > >
    > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <ud...@amazon.com>
    > > > wrote:
    > > > >
    > > > > > Hi Raghvendra,
    > > > > >
    > > > > > You would have to re-write you Parquet Dataset in Hudi format.
    > > > > > Here are the links you can follow to get started:
    > > > > >
    > > > > >
    > > > >
    > > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
    > > > -dataset.html
    > > > > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
    > > > > >
    > > > > > Thanks,
    > > > > > Udit
    > > > > >
    > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
    > > > > > <ra...@delhivery.com.INVALID> wrote:
    > > > > >
    > > > > >     Hi Team,
    > > > > >
    > > > > >     I want to setup incremental view of my AWS S3 parquet data
    > > > > > through Apache
    > > > > >     Hudi, and want to query this data through Athena, but
    > > > > > currently
    > > > > Athena
    > > > > > not
    > > > > >     supporting Hudi Dataset.
    > > > > >
    > > > > >     so there are few questions which I want to understand here
    > > > > >
    > > > > >     1 - How to stream s3 parquet file to Hudi dataset running on
    > EMR.
    > > > > >
    > > > > >     2 - How to query Hudi Dataset running on EMR
    > > > > >
    > > > > >     Please help me to understand this.
    > > > > >
    > > > > >     Thanks
    > > > > >
    > > > > >     Raghvendra
    > > > > >
    > > > > >
    > > > > >
    > > > >
    > > >
    > >
    >
    


Re: Apache Hudi on AWS EMR

Posted by Vinoth Chandar <vi...@apache.org>.
Thanks everyone for chiming in. Esp Gary for the detailed workaround..
(should we FAQ this workaround.. food for thought)

>> if I connect to the Hive catalog on EMR, which is able to provide the
Hudi views correctly, I should be able to get correct results on Athena

Knowing how the Presto/Hudi integration works, simply being able to read
from Hive metastore is not enough. Presto has code to specially recognize
Hudi tables and does an additional filtering step, which lets it query the
data in there correctly. (Gary's workaround above keeps just 1 version
around for a given file (group))..

On Mon, Feb 17, 2020 at 11:28 PM Gary Li <ya...@gmail.com> wrote:

> Hello, I don't have any experience working with Athena but I can share my
> experience working with Impala. There is a workaround.
> By setting Hudi config:
>
>    - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
>    - hoodie.cleaner.fileversions.retained=1
>
> You will have your Hudi dataset as same as plain parquet files. You can
> create a table just like regular parquet. Hudi will write a new commit
> first then delete the older files that have two versions. You need to
> refresh the table metadata store as soon as the Hudi Upsert job finishes.
> For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed the older
> files and before refresh the table metastore, the table will be unavailable
> for query(1-5 mins in my case).
>
> How can we process S3 parquet files(hourly partitioned) through Apache
> Hudi? Is there any streaming layer we need to introduce?
> -----------
> Hudi Delta streamer support parquet file. You can do a bulkInsert for the
> first job then use delta streamer for the Upsert job.
>
> 3 - What should be the parquet file size and row group size for better
> performance on querying Hudi Dataset?
> ----------
> That depends on the query engine you are using and it should be documented
> somewhere. For impala, the optimal size for query performance is 256MB, but
> the larger file size will make upsert more expensive. The size I personally
> choose is 100MB to 128MB.
>
> Thanks,
> Gary
>
>
>
> On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu <ra...@amazon.com.invalid>
> wrote:
>
> > Athena is indeed Presto inside, but there is lot of custom code which has
> > gone on top of Presto there.
> > Couple months back I tried running a glue crawler to catalog a Hudi data
> > set and then query it from Athena. The results were not same as what I
> > would get with running the same query using spark SQL on EMR. Did not try
> > Presto on EMR, but assuming it will work fine on EMR.
> >
> > Athena integration with Hudi data set is planned shortly, but not sure of
> > the date yet.
> >
> > However, recently Athena started supporting integration to a Hive catalog
> > apart from Glue. What that means is in Athena, if I connect to the Hive
> > catalog on EMR, which is able to provide the Hudi views correctly, I
> should
> > be able to get correct results on Athena. Have not tested it though. The
> > feature is in Preview already.
> >
> > Thanks
> > Raghu
> > -----Original Message-----
> > From: Shiyan Xu <xu...@gmail.com>
> > Sent: Tuesday, February 18, 2020 6:20 AM
> > To: dev@hudi.apache.org
> > Cc: Mehrotra, Udit <ud...@amazon.com>; Raghvendra Dhar Dubey
> > <ra...@delhivery.com.invalid>
> > Subject: Re: Apache Hudi on AWS EMR
> >
> > For 2) I think running presto on EMR is able to let you run
> read-optimized
> > queries.
> > I don't quite understand how exactly Athena not support Hudi as it is
> > Presto underlying.
> > Perhaps @Udit could give some insights from AWS?
> >
> > As @Raghvendra you mentioned, another option is to export Hudi dataset to
> > plain parquet files for Athena to query on
> > RFC-9 is for this usecase
> >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
> > The task is inactive now. Feel free to pick up if this is something you'd
> > like to work on. I'd be happy to help with that.
> >
> >
> > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <vi...@apache.org>
> wrote:
> >
> > > Hi Raghvendra,
> > >
> > > Quick sidebar.. Please subscribe to the mailing list, so your message
> > > get published automatically. :)
> > >
> > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
> > > <ra...@delhivery.com.invalid> wrote:
> > >
> > > > Hi Udit,
> > > >
> > > > Thanks for information.
> > > > Actually I am struggling on following points
> > > > 1 - How can we process S3 parquet files(hourly partitioned) through
> > > Apache
> > > > Hudi? Is there any streaming layer we need to introduce? 2 - Is
> > > > there any workaround to query Hudi Dataset from Athena? we are
> > > > thinking to dump resulting Hudi dataset to S3, and then querying
> > > > from Athena. 3 - What should be the parquet file size and row group
> > > > size for better performance on querying Hudi Dataset?
> > > >
> > > > Thanks
> > > > Raghvendra
> > > >
> > > >
> > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <ud...@amazon.com>
> > > wrote:
> > > >
> > > > > Hi Raghvendra,
> > > > >
> > > > > You would have to re-write you Parquet Dataset in Hudi format.
> > > > > Here are the links you can follow to get started:
> > > > >
> > > > >
> > > >
> > > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
> > > -dataset.html
> > > > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
> > > > >
> > > > > Thanks,
> > > > > Udit
> > > > >
> > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> > > > > <ra...@delhivery.com.INVALID> wrote:
> > > > >
> > > > >     Hi Team,
> > > > >
> > > > >     I want to setup incremental view of my AWS S3 parquet data
> > > > > through Apache
> > > > >     Hudi, and want to query this data through Athena, but
> > > > > currently
> > > > Athena
> > > > > not
> > > > >     supporting Hudi Dataset.
> > > > >
> > > > >     so there are few questions which I want to understand here
> > > > >
> > > > >     1 - How to stream s3 parquet file to Hudi dataset running on
> EMR.
> > > > >
> > > > >     2 - How to query Hudi Dataset running on EMR
> > > > >
> > > > >     Please help me to understand this.
> > > > >
> > > > >     Thanks
> > > > >
> > > > >     Raghvendra
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Apache Hudi on AWS EMR

Posted by Gary Li <ya...@gmail.com>.
Hello, I don't have any experience working with Athena but I can share my
experience working with Impala. There is a workaround.
By setting Hudi config:

   - hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
   - hoodie.cleaner.fileversions.retained=1

You will have your Hudi dataset as same as plain parquet files. You can
create a table just like regular parquet. Hudi will write a new commit
first then delete the older files that have two versions. You need to
refresh the table metadata store as soon as the Hudi Upsert job finishes.
For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed the older
files and before refresh the table metastore, the table will be unavailable
for query(1-5 mins in my case).

How can we process S3 parquet files(hourly partitioned) through Apache
Hudi? Is there any streaming layer we need to introduce?
-----------
Hudi Delta streamer support parquet file. You can do a bulkInsert for the
first job then use delta streamer for the Upsert job.

3 - What should be the parquet file size and row group size for better
performance on querying Hudi Dataset?
----------
That depends on the query engine you are using and it should be documented
somewhere. For impala, the optimal size for query performance is 256MB, but
the larger file size will make upsert more expensive. The size I personally
choose is 100MB to 128MB.

Thanks,
Gary



On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu <ra...@amazon.com.invalid>
wrote:

> Athena is indeed Presto inside, but there is lot of custom code which has
> gone on top of Presto there.
> Couple months back I tried running a glue crawler to catalog a Hudi data
> set and then query it from Athena. The results were not same as what I
> would get with running the same query using spark SQL on EMR. Did not try
> Presto on EMR, but assuming it will work fine on EMR.
>
> Athena integration with Hudi data set is planned shortly, but not sure of
> the date yet.
>
> However, recently Athena started supporting integration to a Hive catalog
> apart from Glue. What that means is in Athena, if I connect to the Hive
> catalog on EMR, which is able to provide the Hudi views correctly, I should
> be able to get correct results on Athena. Have not tested it though. The
> feature is in Preview already.
>
> Thanks
> Raghu
> -----Original Message-----
> From: Shiyan Xu <xu...@gmail.com>
> Sent: Tuesday, February 18, 2020 6:20 AM
> To: dev@hudi.apache.org
> Cc: Mehrotra, Udit <ud...@amazon.com>; Raghvendra Dhar Dubey
> <ra...@delhivery.com.invalid>
> Subject: Re: Apache Hudi on AWS EMR
>
> For 2) I think running presto on EMR is able to let you run read-optimized
> queries.
> I don't quite understand how exactly Athena not support Hudi as it is
> Presto underlying.
> Perhaps @Udit could give some insights from AWS?
>
> As @Raghvendra you mentioned, another option is to export Hudi dataset to
> plain parquet files for Athena to query on
> RFC-9 is for this usecase
>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
> The task is inactive now. Feel free to pick up if this is something you'd
> like to work on. I'd be happy to help with that.
>
>
> On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <vi...@apache.org> wrote:
>
> > Hi Raghvendra,
> >
> > Quick sidebar.. Please subscribe to the mailing list, so your message
> > get published automatically. :)
> >
> > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
> > <ra...@delhivery.com.invalid> wrote:
> >
> > > Hi Udit,
> > >
> > > Thanks for information.
> > > Actually I am struggling on following points
> > > 1 - How can we process S3 parquet files(hourly partitioned) through
> > Apache
> > > Hudi? Is there any streaming layer we need to introduce? 2 - Is
> > > there any workaround to query Hudi Dataset from Athena? we are
> > > thinking to dump resulting Hudi dataset to S3, and then querying
> > > from Athena. 3 - What should be the parquet file size and row group
> > > size for better performance on querying Hudi Dataset?
> > >
> > > Thanks
> > > Raghvendra
> > >
> > >
> > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <ud...@amazon.com>
> > wrote:
> > >
> > > > Hi Raghvendra,
> > > >
> > > > You would have to re-write you Parquet Dataset in Hudi format.
> > > > Here are the links you can follow to get started:
> > > >
> > > >
> > >
> > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
> > -dataset.html
> > > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
> > > >
> > > > Thanks,
> > > > Udit
> > > >
> > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> > > > <ra...@delhivery.com.INVALID> wrote:
> > > >
> > > >     Hi Team,
> > > >
> > > >     I want to setup incremental view of my AWS S3 parquet data
> > > > through Apache
> > > >     Hudi, and want to query this data through Athena, but
> > > > currently
> > > Athena
> > > > not
> > > >     supporting Hudi Dataset.
> > > >
> > > >     so there are few questions which I want to understand here
> > > >
> > > >     1 - How to stream s3 parquet file to Hudi dataset running on EMR.
> > > >
> > > >     2 - How to query Hudi Dataset running on EMR
> > > >
> > > >     Please help me to understand this.
> > > >
> > > >     Thanks
> > > >
> > > >     Raghvendra
> > > >
> > > >
> > > >
> > >
> >
>

RE: Apache Hudi on AWS EMR

Posted by "Dubey, Raghu" <ra...@amazon.com.INVALID>.
Athena is indeed Presto inside, but there is lot of custom code which has gone on top of Presto there.
Couple months back I tried running a glue crawler to catalog a Hudi data set and then query it from Athena. The results were not same as what I would get with running the same query using spark SQL on EMR. Did not try Presto on EMR, but assuming it will work fine on EMR.

Athena integration with Hudi data set is planned shortly, but not sure of the date yet.

However, recently Athena started supporting integration to a Hive catalog apart from Glue. What that means is in Athena, if I connect to the Hive catalog on EMR, which is able to provide the Hudi views correctly, I should be able to get correct results on Athena. Have not tested it though. The feature is in Preview already.

Thanks
Raghu
-----Original Message-----
From: Shiyan Xu <xu...@gmail.com> 
Sent: Tuesday, February 18, 2020 6:20 AM
To: dev@hudi.apache.org
Cc: Mehrotra, Udit <ud...@amazon.com>; Raghvendra Dhar Dubey <ra...@delhivery.com.invalid>
Subject: Re: Apache Hudi on AWS EMR

For 2) I think running presto on EMR is able to let you run read-optimized queries.
I don't quite understand how exactly Athena not support Hudi as it is Presto underlying.
Perhaps @Udit could give some insights from AWS?

As @Raghvendra you mentioned, another option is to export Hudi dataset to plain parquet files for Athena to query on
RFC-9 is for this usecase
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
The task is inactive now. Feel free to pick up if this is something you'd like to work on. I'd be happy to help with that.


On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Raghvendra,
>
> Quick sidebar.. Please subscribe to the mailing list, so your message 
> get published automatically. :)
>
> On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey 
> <ra...@delhivery.com.invalid> wrote:
>
> > Hi Udit,
> >
> > Thanks for information.
> > Actually I am struggling on following points
> > 1 - How can we process S3 parquet files(hourly partitioned) through
> Apache
> > Hudi? Is there any streaming layer we need to introduce? 2 - Is 
> > there any workaround to query Hudi Dataset from Athena? we are 
> > thinking to dump resulting Hudi dataset to S3, and then querying 
> > from Athena. 3 - What should be the parquet file size and row group 
> > size for better performance on querying Hudi Dataset?
> >
> > Thanks
> > Raghvendra
> >
> >
> > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <ud...@amazon.com>
> wrote:
> >
> > > Hi Raghvendra,
> > >
> > > You would have to re-write you Parquet Dataset in Hudi format. 
> > > Here are the links you can follow to get started:
> > >
> > >
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
> -dataset.html
> > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
> > >
> > > Thanks,
> > > Udit
> > >
> > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> > > <ra...@delhivery.com.INVALID> wrote:
> > >
> > >     Hi Team,
> > >
> > >     I want to setup incremental view of my AWS S3 parquet data 
> > > through Apache
> > >     Hudi, and want to query this data through Athena, but 
> > > currently
> > Athena
> > > not
> > >     supporting Hudi Dataset.
> > >
> > >     so there are few questions which I want to understand here
> > >
> > >     1 - How to stream s3 parquet file to Hudi dataset running on EMR.
> > >
> > >     2 - How to query Hudi Dataset running on EMR
> > >
> > >     Please help me to understand this.
> > >
> > >     Thanks
> > >
> > >     Raghvendra
> > >
> > >
> > >
> >
>

Re: Apache Hudi on AWS EMR

Posted by Shiyan Xu <xu...@gmail.com>.
For 2) I think running presto on EMR is able to let you run read-optimized
queries.
I don't quite understand how exactly Athena not support Hudi as it is
Presto underlying.
Perhaps @Udit could give some insights from AWS?

As @Raghvendra you mentioned, another option is to export Hudi dataset to
plain parquet files for Athena to query on
RFC-9 is for this usecase
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
The task is inactive now. Feel free to pick up if this is something you'd
like to work on. I'd be happy to help with that.


On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar <vi...@apache.org> wrote:

> Hi Raghvendra,
>
> Quick sidebar.. Please subscribe to the mailing list, so your message get
> published automatically. :)
>
> On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
> <ra...@delhivery.com.invalid> wrote:
>
> > Hi Udit,
> >
> > Thanks for information.
> > Actually I am struggling on following points
> > 1 - How can we process S3 parquet files(hourly partitioned) through
> Apache
> > Hudi? Is there any streaming layer we need to introduce? 2 - Is there any
> > workaround to query Hudi Dataset from Athena? we are thinking to dump
> > resulting Hudi dataset to S3, and then querying from Athena. 3 - What
> > should be the parquet file size and row group size for better performance
> > on querying Hudi Dataset?
> >
> > Thanks
> > Raghvendra
> >
> >
> > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <ud...@amazon.com>
> wrote:
> >
> > > Hi Raghvendra,
> > >
> > > You would have to re-write you Parquet Dataset in Hudi format. Here are
> > > the links you can follow to get started:
> > >
> > >
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
> > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
> > >
> > > Thanks,
> > > Udit
> > >
> > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> > > <ra...@delhivery.com.INVALID> wrote:
> > >
> > >     Hi Team,
> > >
> > >     I want to setup incremental view of my AWS S3 parquet data through
> > > Apache
> > >     Hudi, and want to query this data through Athena, but currently
> > Athena
> > > not
> > >     supporting Hudi Dataset.
> > >
> > >     so there are few questions which I want to understand here
> > >
> > >     1 - How to stream s3 parquet file to Hudi dataset running on EMR.
> > >
> > >     2 - How to query Hudi Dataset running on EMR
> > >
> > >     Please help me to understand this.
> > >
> > >     Thanks
> > >
> > >     Raghvendra
> > >
> > >
> > >
> >
>

Re: Apache Hudi on AWS EMR

Posted by Vinoth Chandar <vi...@apache.org>.
Hi Raghvendra,

Quick sidebar.. Please subscribe to the mailing list, so your message get
published automatically. :)

On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
<ra...@delhivery.com.invalid> wrote:

> Hi Udit,
>
> Thanks for information.
> Actually I am struggling on following points
> 1 - How can we process S3 parquet files(hourly partitioned) through Apache
> Hudi? Is there any streaming layer we need to introduce? 2 - Is there any
> workaround to query Hudi Dataset from Athena? we are thinking to dump
> resulting Hudi dataset to S3, and then querying from Athena. 3 - What
> should be the parquet file size and row group size for better performance
> on querying Hudi Dataset?
>
> Thanks
> Raghvendra
>
>
> On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <ud...@amazon.com> wrote:
>
> > Hi Raghvendra,
> >
> > You would have to re-write you Parquet Dataset in Hudi format. Here are
> > the links you can follow to get started:
> >
> >
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
> > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
> >
> > Thanks,
> > Udit
> >
> > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> > <ra...@delhivery.com.INVALID> wrote:
> >
> >     Hi Team,
> >
> >     I want to setup incremental view of my AWS S3 parquet data through
> > Apache
> >     Hudi, and want to query this data through Athena, but currently
> Athena
> > not
> >     supporting Hudi Dataset.
> >
> >     so there are few questions which I want to understand here
> >
> >     1 - How to stream s3 parquet file to Hudi dataset running on EMR.
> >
> >     2 - How to query Hudi Dataset running on EMR
> >
> >     Please help me to understand this.
> >
> >     Thanks
> >
> >     Raghvendra
> >
> >
> >
>

Re: Apache Hudi on AWS EMR

Posted by Raghvendra Dhar Dubey <ra...@delhivery.com.INVALID>.
Hi Udit,

Thanks for information.
Actually I am struggling on following points
1 - How can we process S3 parquet files(hourly partitioned) through Apache
Hudi? Is there any streaming layer we need to introduce? 2 - Is there any
workaround to query Hudi Dataset from Athena? we are thinking to dump
resulting Hudi dataset to S3, and then querying from Athena. 3 - What
should be the parquet file size and row group size for better performance
on querying Hudi Dataset?

Thanks
Raghvendra


On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit <ud...@amazon.com> wrote:

> Hi Raghvendra,
>
> You would have to re-write you Parquet Dataset in Hudi format. Here are
> the links you can follow to get started:
>
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
> https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
>
> Thanks,
> Udit
>
> On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> <ra...@delhivery.com.INVALID> wrote:
>
>     Hi Team,
>
>     I want to setup incremental view of my AWS S3 parquet data through
> Apache
>     Hudi, and want to query this data through Athena, but currently Athena
> not
>     supporting Hudi Dataset.
>
>     so there are few questions which I want to understand here
>
>     1 - How to stream s3 parquet file to Hudi dataset running on EMR.
>
>     2 - How to query Hudi Dataset running on EMR
>
>     Please help me to understand this.
>
>     Thanks
>
>     Raghvendra
>
>
>

Re: Apache Hudi on AWS EMR

Posted by "Mehrotra, Udit" <ud...@amazon.com.INVALID>.
Hi Raghvendra,

You would have to re-write you Parquet Dataset in Hudi format. Here are the links you can follow to get started:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
https://hudi.apache.org/docs/querying_data.html#spark-incr-pull

Thanks,
Udit

On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey" <ra...@delhivery.com.INVALID> wrote:

    Hi Team,
    
    I want to setup incremental view of my AWS S3 parquet data through Apache
    Hudi, and want to query this data through Athena, but currently Athena not
    supporting Hudi Dataset.
    
    so there are few questions which I want to understand here
    
    1 - How to stream s3 parquet file to Hudi dataset running on EMR.
    
    2 - How to query Hudi Dataset running on EMR
    
    Please help me to understand this.
    
    Thanks
    
    Raghvendra