You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Aniruddh Sharma <as...@gmail.com> on 2020/03/19 14:19:56 UTC

detach subset of columns from PCollection ;do some operations; reattach transformed columns

Hi 

Need some advise on how to implement following use case.

I read dataset which is 1+ TB in size, this has 1000+ columns.

Only 3 columns out of these 1000+ columns contain PII information and I need to call Google DLP API.

I want to select only 3 columns out of these 1000+ columns and submit only these 3 columns to DLP API. Once I get the results back from DLP, I want to change these 3 columns in my original data set.

I dont have any UUID for each row, so I will not be able to join original data (1000+ columns) with another data (3 columns). 

Any suggestions how to implement it.

Thanks
Aniruddh

Re: detach subset of columns from PCollection ;do some operations; reattach transformed columns

Posted by Luke Cwik <lc...@google.com>.
What is your data source?

Can you add a row identifier or use some combination of columns as a unique
key?

On Thu, Mar 19, 2020 at 7:20 AM Aniruddh Sharma <as...@gmail.com>
wrote:

> Hi
>
> Need some advise on how to implement following use case.
>
> I read dataset which is 1+ TB in size, this has 1000+ columns.
>
> Only 3 columns out of these 1000+ columns contain PII information and I
> need to call Google DLP API.
>
> I want to select only 3 columns out of these 1000+ columns and submit only
> these 3 columns to DLP API. Once I get the results back from DLP, I want to
> change these 3 columns in my original data set.
>
> I dont have any UUID for each row, so I will not be able to join original
> data (1000+ columns) with another data (3 columns).
>
> Any suggestions how to implement it.
>
> Thanks
> Aniruddh
>