You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@hudi.apache.org by Sid Kal <fl...@gmail.com> on 2022/02/20 19:49:01 UTC

Fwd: How CDC works especially in delete case

We have a use case for which we were planning to use Hudi tables for CDC
purposes. Basically, my whole intention is to perform upserts along with
the deletes. So, if a record in my source system is deleted, it should be
deleted from my target as well.

I went through this link where a user is performing CDC using Hudi.
https://towardsdatascience.com/data-lake-change-data-capture-cdc-using-apache-hudi-on-amazon-emr-part-2-process-65e4662d7b4b

My question is how does Hudi internally recognize the records in the
incremental data load? So how should the incremental file be using which we
can recognize which records are meant to be appended/deleted/updates.

I am actually confused with this part:

S3_INCR_RAW_DATA =
"s3://aws-analytics-course/raw/dms/fossil/coal_prod/20200808-*.csv"
df_coal_prod_incr = spark.read.csv(S3_INCR_RAW_DATA, header=False,
schema=coal_prod_schema)
df_coal_prod_incr_u_i=df_coal_prod_incr.filter("Mode IN ('U', 'I')")

Where the user is directly filtering out on mode. Is "Mode" a column inside
the dataset? Or how is it gonna be?

I am a newbie to Hudi.

Thanks,
Sid

Re: How CDC works especially in delete case

Posted by Sid <fl...@gmail.com>.

Hi,

Can someone help me to understand this ?

Thanks,
Sid

On Mon, 21 Feb 2022, 13:02 Sid Kal, <fl...@gmail.com> wrote:

> Hi Danny,
>
> Thank you for your response.
>
> The file is actually used by the user who wrote that blog. In my actual
> dataset, I have a schema like
> customerid,customername,effective_date,customer_mob. In this case, how will
> Hudi manage CDC?
>
> Thanks,
> Sid
>
> On Mon, Feb 21, 2022 at 8:06 AM Danny Chan <da...@apache.org> wrote:
>
>> Hello, what is the schema of the reading file: S3_INCR_RAW_DATA ?
>>
>> Best,
>> Danny
>>
>> Sid Kal <fl...@gmail.com> 于2022年2月21日周一 03:49写道：
>> >
>> >
>> >
>> >
>> >
>> >
>> > We have a use case for which we were planning to use Hudi tables for
>> CDC purposes. Basically, my whole intention is to perform upserts along
>> with the deletes. So, if a record in my source system is deleted, it should
>> be deleted from my target as well.
>> >
>> > I went through this link where a user is performing CDC using Hudi.
>> >
>> https://towardsdatascience.com/data-lake-change-data-capture-cdc-using-apache-hudi-on-amazon-emr-part-2-process-65e4662d7b4b
>> >
>> > My question is how does Hudi internally recognize the records in the
>> incremental data load? So how should the incremental file be using which we
>> can recognize which records are meant to be appended/deleted/updates.
>> >
>> > I am actually confused with this part:
>> >
>> > S3_INCR_RAW_DATA =
>> "s3://aws-analytics-course/raw/dms/fossil/coal_prod/20200808-*.csv"
>> > df_coal_prod_incr = spark.read.csv(S3_INCR_RAW_DATA, header=False,
>> schema=coal_prod_schema)
>> > df_coal_prod_incr_u_i=df_coal_prod_incr.filter("Mode IN ('U', 'I')")
>> >
>> > Where the user is directly filtering out on mode. Is "Mode" a column
>> inside the dataset? Or how is it gonna be?
>> >
>> > I am a newbie to Hudi.
>> >
>> > Thanks,
>> > Sid
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@hudi.apache.org
>> For additional commands, e-mail: users-help@hudi.apache.org
>>
>>

Re: How CDC works especially in delete case

Posted by Sid Kal <fl...@gmail.com>.

Hi Danny,

Thank you for your response.

The file is actually used by the user who wrote that blog. In my actual
dataset, I have a schema like
customerid,customername,effective_date,customer_mob. In this case, how will
Hudi manage CDC?

Thanks,
Sid

On Mon, Feb 21, 2022 at 8:06 AM Danny Chan <da...@apache.org> wrote:

> Hello, what is the schema of the reading file: S3_INCR_RAW_DATA ?
>
> Best,
> Danny
>
> Sid Kal <fl...@gmail.com> 于2022年2月21日周一 03:49写道：
> >
> >
> >
> >
> >
> >
> > We have a use case for which we were planning to use Hudi tables for CDC
> purposes. Basically, my whole intention is to perform upserts along with
> the deletes. So, if a record in my source system is deleted, it should be
> deleted from my target as well.
> >
> > I went through this link where a user is performing CDC using Hudi.
> >
> https://towardsdatascience.com/data-lake-change-data-capture-cdc-using-apache-hudi-on-amazon-emr-part-2-process-65e4662d7b4b
> >
> > My question is how does Hudi internally recognize the records in the
> incremental data load? So how should the incremental file be using which we
> can recognize which records are meant to be appended/deleted/updates.
> >
> > I am actually confused with this part:
> >
> > S3_INCR_RAW_DATA =
> "s3://aws-analytics-course/raw/dms/fossil/coal_prod/20200808-*.csv"
> > df_coal_prod_incr = spark.read.csv(S3_INCR_RAW_DATA, header=False,
> schema=coal_prod_schema)
> > df_coal_prod_incr_u_i=df_coal_prod_incr.filter("Mode IN ('U', 'I')")
> >
> > Where the user is directly filtering out on mode. Is "Mode" a column
> inside the dataset? Or how is it gonna be?
> >
> > I am a newbie to Hudi.
> >
> > Thanks,
> > Sid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@hudi.apache.org
> For additional commands, e-mail: users-help@hudi.apache.org
>
>

Re: How CDC works especially in delete case

Posted by Danny Chan <da...@apache.org>.

Hello, what is the schema of the reading file: S3_INCR_RAW_DATA ?

Best,
Danny

Sid Kal <fl...@gmail.com> 于2022年2月21日周一 03:49写道：
>
>
>
>
>
>
> We have a use case for which we were planning to use Hudi tables for CDC purposes. Basically, my whole intention is to perform upserts along with the deletes. So, if a record in my source system is deleted, it should be deleted from my target as well.
>
> I went through this link where a user is performing CDC using Hudi.
> https://towardsdatascience.com/data-lake-change-data-capture-cdc-using-apache-hudi-on-amazon-emr-part-2-process-65e4662d7b4b
>
> My question is how does Hudi internally recognize the records in the incremental data load? So how should the incremental file be using which we can recognize which records are meant to be appended/deleted/updates.
>
> I am actually confused with this part:
>
> S3_INCR_RAW_DATA = "s3://aws-analytics-course/raw/dms/fossil/coal_prod/20200808-*.csv"
> df_coal_prod_incr = spark.read.csv(S3_INCR_RAW_DATA, header=False, schema=coal_prod_schema)
> df_coal_prod_incr_u_i=df_coal_prod_incr.filter("Mode IN ('U', 'I')")
>
> Where the user is directly filtering out on mode. Is "Mode" a column inside the dataset? Or how is it gonna be?
>
> I am a newbie to Hudi.
>
> Thanks,
> Sid

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@hudi.apache.org
For additional commands, e-mail: users-help@hudi.apache.org