You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Aniruddh Sharma <as...@gmail.com> on 2020/03/19 14:19:56 UTC
detach subset of columns from PCollection ;do some operations; reattach transformed columns
Hi
Need some advise on how to implement following use case.
I read dataset which is 1+ TB in size, this has 1000+ columns.
Only 3 columns out of these 1000+ columns contain PII information and I need to call Google DLP API.
I want to select only 3 columns out of these 1000+ columns and submit only these 3 columns to DLP API. Once I get the results back from DLP, I want to change these 3 columns in my original data set.
I dont have any UUID for each row, so I will not be able to join original data (1000+ columns) with another data (3 columns).
Any suggestions how to implement it.
Thanks
Aniruddh
Re: detach subset of columns from PCollection ;do some operations;
reattach transformed columns
Posted by Luke Cwik <lc...@google.com>.
What is your data source?
Can you add a row identifier or use some combination of columns as a unique
key?
On Thu, Mar 19, 2020 at 7:20 AM Aniruddh Sharma <as...@gmail.com>
wrote:
> Hi
>
> Need some advise on how to implement following use case.
>
> I read dataset which is 1+ TB in size, this has 1000+ columns.
>
> Only 3 columns out of these 1000+ columns contain PII information and I
> need to call Google DLP API.
>
> I want to select only 3 columns out of these 1000+ columns and submit only
> these 3 columns to DLP API. Once I get the results back from DLP, I want to
> change these 3 columns in my original data set.
>
> I dont have any UUID for each row, so I will not be able to join original
> data (1000+ columns) with another data (3 columns).
>
> Any suggestions how to implement it.
>
> Thanks
> Aniruddh
>