You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by Nam Đỗ Duy via user <us...@kylin.apache.org> on 2023/12/12 04:10:34 UTC

ACID with Hive/Kylin

Dear Xiaoxiang, Sirs/Madams

I face an issue with deleting data of user according to GPDR-like policy
which means when user send request to delete their personal data, we need
to delete it from all system, that means to delete data:

1- from Kylin index (cube)
2- from Hive
3- from HDFS

Have you had the same use-case before, do you have any suggestions to
achieve this scenario?

Thank you very much and best regards

Re: ACID with Hive/Kylin

Posted by Nam Đỗ Duy <na...@vnpay.vn.INVALID>.

Thank you both of you for your valuable information. I will test and revert
soon.

Best regards

On Tue, Dec 12, 2023 at 2:39 PM Xiaoxiang Yu <xx...@apache.org> wrote:

> I don't know GDPR very well. Here is my understanding.
>
> For hive and hdfs, you can consider using these techniques which support
> ACID in Spark and Hive(I recommend first one):
> 1) Delta Lake,
> https://docs.databricks.com/en/security/privacy/gdpr-delta.html
> 2) Hive ACID table, here is a link,
>
> https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/migrate-hive-workloads/topics/hive-acid-migration-regulations.html
>
> For Kylin, there are three places which may store data, index, snapshot,
> dict. The refresh of the snapshot costs
> less time and resources,  while refresh of index/dict much more. Snapshot
> refresh will be triggered automatically
> when you build an index every day.
>
> I think you should consider centralizing user-sensitive columns(email,
> phone, address) in dimension tables,
> and your fact table only has the foreign key(for example, uid) which refers
> to the primary key of dimension tables.
> When you are modeling in Kylin, for these dim tables which contains
> user-sensitive columns, try
>
> 1. set dim tables as snapshot by disable precompute join relation, so these
> columns won't be built into indexes, refer
>
> https://kylin.apache.org/5.0/docs/modeling/model_design/precompute_join_relations
> 2. not create a bitmap measure on these columns, so these columns won't be
> built into dict
>
> ------------------------
> With warm regard
> Xiaoxiang Yu
>
>
>
> On Tue, Dec 12, 2023 at 12:11 PM Nam Đỗ Duy <na...@vnpay.vn.invalid>
> wrote:
>
> > Dear Xiaoxiang, Sirs/Madams
> >
> > I face an issue with deleting data of user according to GPDR-like policy
> > which means when user send request to delete their personal data, we need
> > to delete it from all system, that means to delete data:
> >
> > 1- from Kylin index (cube)
> > 2- from Hive
> > 3- from HDFS
> >
> > Have you had the same use-case before, do you have any suggestions to
> > achieve this scenario?
> >
> > Thank you very much and best regards
> >
>

Re: ACID with Hive/Kylin

Posted by Nam Đỗ Duy via user <us...@kylin.apache.org>.

Thank you both of you for your valuable information. I will test and revert
soon.

Best regards

On Tue, Dec 12, 2023 at 2:39 PM Xiaoxiang Yu <xx...@apache.org> wrote:

> I don't know GDPR very well. Here is my understanding.
>
> For hive and hdfs, you can consider using these techniques which support
> ACID in Spark and Hive(I recommend first one):
> 1) Delta Lake,
> https://docs.databricks.com/en/security/privacy/gdpr-delta.html
> 2) Hive ACID table, here is a link,
>
> https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/migrate-hive-workloads/topics/hive-acid-migration-regulations.html
>
> For Kylin, there are three places which may store data, index, snapshot,
> dict. The refresh of the snapshot costs
> less time and resources,  while refresh of index/dict much more. Snapshot
> refresh will be triggered automatically
> when you build an index every day.
>
> I think you should consider centralizing user-sensitive columns(email,
> phone, address) in dimension tables,
> and your fact table only has the foreign key(for example, uid) which refers
> to the primary key of dimension tables.
> When you are modeling in Kylin, for these dim tables which contains
> user-sensitive columns, try
>
> 1. set dim tables as snapshot by disable precompute join relation, so these
> columns won't be built into indexes, refer
>
> https://kylin.apache.org/5.0/docs/modeling/model_design/precompute_join_relations
> 2. not create a bitmap measure on these columns, so these columns won't be
> built into dict
>
> ------------------------
> With warm regard
> Xiaoxiang Yu
>
>
>
> On Tue, Dec 12, 2023 at 12:11 PM Nam Đỗ Duy <na...@vnpay.vn.invalid>
> wrote:
>
> > Dear Xiaoxiang, Sirs/Madams
> >
> > I face an issue with deleting data of user according to GPDR-like policy
> > which means when user send request to delete their personal data, we need
> > to delete it from all system, that means to delete data:
> >
> > 1- from Kylin index (cube)
> > 2- from Hive
> > 3- from HDFS
> >
> > Have you had the same use-case before, do you have any suggestions to
> > achieve this scenario?
> >
> > Thank you very much and best regards
> >
>

Re: ACID with Hive/Kylin

Posted by Xiaoxiang Yu <xx...@apache.org>.

I don't know GDPR very well. Here is my understanding.

For hive and hdfs, you can consider using these techniques which support
ACID in Spark and Hive(I recommend first one):
1) Delta Lake,
https://docs.databricks.com/en/security/privacy/gdpr-delta.html
2) Hive ACID table, here is a link,
https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/migrate-hive-workloads/topics/hive-acid-migration-regulations.html

For Kylin, there are three places which may store data, index, snapshot,
dict. The refresh of the snapshot costs
less time and resources,  while refresh of index/dict much more. Snapshot
refresh will be triggered automatically
when you build an index every day.

I think you should consider centralizing user-sensitive columns(email,
phone, address) in dimension tables,
and your fact table only has the foreign key(for example, uid) which refers
to the primary key of dimension tables.
When you are modeling in Kylin, for these dim tables which contains
user-sensitive columns, try

1. set dim tables as snapshot by disable precompute join relation, so these
columns won't be built into indexes, refer
https://kylin.apache.org/5.0/docs/modeling/model_design/precompute_join_relations
2. not create a bitmap measure on these columns, so these columns won't be
built into dict

------------------------
With warm regard
Xiaoxiang Yu

On Tue, Dec 12, 2023 at 12:11 PM Nam Đỗ Duy <na...@vnpay.vn.invalid> wrote:

> Dear Xiaoxiang, Sirs/Madams
>
> I face an issue with deleting data of user according to GPDR-like policy
> which means when user send request to delete their personal data, we need
> to delete it from all system, that means to delete data:
>
> 1- from Kylin index (cube)
> 2- from Hive
> 3- from HDFS
>
> Have you had the same use-case before, do you have any suggestions to
> achieve this scenario?
>
> Thank you very much and best regards
>

Re: ACID with Hive/Kylin

Posted by ShaoFeng Shi <sh...@apache.org>.

Hi Nam,

As Kylin is used to store the aggregated data, there should be no PII
information. (if you use Kylin to manage person level data, that is not a
good case).

If you do need to delete certain personal data, refresh the whole index or
some partitions is what we can do.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC,
Apache Incubator PMC,
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Nam Đỗ Duy <na...@vnpay.vn.invalid> 于2023年12月12日周二 12:11写道：

> Dear Xiaoxiang, Sirs/Madams
>
> I face an issue with deleting data of user according to GPDR-like policy
> which means when user send request to delete their personal data, we need
> to delete it from all system, that means to delete data:
>
> 1- from Kylin index (cube)
> 2- from Hive
> 3- from HDFS
>
> Have you had the same use-case before, do you have any suggestions to
> achieve this scenario?
>
> Thank you very much and best regards
>

Re: ACID with Hive/Kylin

Posted by Xiaoxiang Yu <xx...@apache.org>.

I don't know GDPR very well. Here is my understanding.

For hive and hdfs, you can consider using these techniques which support
ACID in Spark and Hive(I recommend first one):
1) Delta Lake,
https://docs.databricks.com/en/security/privacy/gdpr-delta.html
2) Hive ACID table, here is a link,
https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/migrate-hive-workloads/topics/hive-acid-migration-regulations.html

For Kylin, there are three places which may store data, index, snapshot,
dict. The refresh of the snapshot costs
less time and resources,  while refresh of index/dict much more. Snapshot
refresh will be triggered automatically
when you build an index every day.

I think you should consider centralizing user-sensitive columns(email,
phone, address) in dimension tables,
and your fact table only has the foreign key(for example, uid) which refers
to the primary key of dimension tables.
When you are modeling in Kylin, for these dim tables which contains
user-sensitive columns, try

1. set dim tables as snapshot by disable precompute join relation, so these
columns won't be built into indexes, refer
https://kylin.apache.org/5.0/docs/modeling/model_design/precompute_join_relations
2. not create a bitmap measure on these columns, so these columns won't be
built into dict

------------------------
With warm regard
Xiaoxiang Yu

On Tue, Dec 12, 2023 at 12:11 PM Nam Đỗ Duy <na...@vnpay.vn.invalid> wrote:

> Dear Xiaoxiang, Sirs/Madams
>
> I face an issue with deleting data of user according to GPDR-like policy
> which means when user send request to delete their personal data, we need
> to delete it from all system, that means to delete data:
>
> 1- from Kylin index (cube)
> 2- from Hive
> 3- from HDFS
>
> Have you had the same use-case before, do you have any suggestions to
> achieve this scenario?
>
> Thank you very much and best regards
>

Re: ACID with Hive/Kylin

Posted by ShaoFeng Shi <sh...@apache.org>.

Hi Nam,

As Kylin is used to store the aggregated data, there should be no PII
information. (if you use Kylin to manage person level data, that is not a
good case).

If you do need to delete certain personal data, refresh the whole index or
some partitions is what we can do.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC,
Apache Incubator PMC,
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Nam Đỗ Duy <na...@vnpay.vn.invalid> 于2023年12月12日周二 12:11写道：

> Dear Xiaoxiang, Sirs/Madams
>
> I face an issue with deleting data of user according to GPDR-like policy
> which means when user send request to delete their personal data, we need
> to delete it from all system, that means to delete data:
>
> 1- from Kylin index (cube)
> 2- from Hive
> 3- from HDFS
>
> Have you had the same use-case before, do you have any suggestions to
> achieve this scenario?
>
> Thank you very much and best regards
>