You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2019/03/16 17:39:07 UTC

Masking username in Spark with regexp_replace and reverse functions

Hi,

I am looking at Description column of a bank statement (CSV download) that
has the following format

scala> account_table.printSchema
root
 |-- TransactionDate: date (nullable = true)
 |-- TransactionType: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Value: double (nullable = true)
 |-- Balance: double (nullable = true)
 |-- AccountName: string (nullable = true)
 |-- AccountNumber: string (nullable = true)

The column description for BACS payments contains the name of the
individual who paid into the third party account. I need to mask the name
but cannot simply use a literal as below for all contents of descriptions
column!

f1.withColumn("Description", lit("*** Masked
***")).select('Description.as("Who paid")

So I try the following combination

f1.select(trim(substring(substring_index('Description, ",",
1),2,50)).as("name in clear"),
reverse(regexp_replace(regexp_replace(regexp_replace(substring(regexp_replace('Description,
"^['A-Z]", "XX"),2,6),"[A-F]","X")," ","X"),"[,]","R")).as("Masked")).show
+------------------+------+
|          in clear|Masked|
+------------------+------+
|       FATAH SABAH|HXTXXX|
|       C HIGGINSON|GIHXXX|
|           SOLTA A|XTLOSX|
+------------------+------+

This seems to work as it not only masks the name but also makes it
consistent for all names (in other words, the same username gets the same
mask).

Are there any better alternatives?

Thanks

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Masking username in Spark with regexp_replace and reverse functions

Posted by JB Data31 <jb...@gmail.com>.

Hi,

Why don't add a random regexp in regexp substitution, i.e.
https://onlinerandomtools.com/generate-random-data-from-regexp

@*JB*Δ <http://jbigdata.fr/jbigdata/index.html>



Le sam. 16 mars 2019 à 18:39, Mich Talebzadeh <mi...@gmail.com> a
écrit :

> Hi,
>
> I am looking at Description column of a bank statement (CSV download) that
> has the following format
>
> scala> account_table.printSchema
> root
>  |-- TransactionDate: date (nullable = true)
>  |-- TransactionType: string (nullable = true)
>  |-- Description: string (nullable = true)
>  |-- Value: double (nullable = true)
>  |-- Balance: double (nullable = true)
>  |-- AccountName: string (nullable = true)
>  |-- AccountNumber: string (nullable = true)
>
> The column description for BACS payments contains the name of the
> individual who paid into the third party account. I need to mask the name
> but cannot simply use a literal as below for all contents of descriptions
> column!
>
> f1.withColumn("Description", lit("*** Masked
> ***")).select('Description.as("Who paid")
>
> So I try the following combination
>
> f1.select(trim(substring(substring_index('Description, ",",
> 1),2,50)).as("name in clear"),
> reverse(regexp_replace(regexp_replace(regexp_replace(substring(regexp_replace('Description,
> "^['A-Z]", "XX"),2,6),"[A-F]","X")," ","X"),"[,]","R")).as("Masked")).show
> +------------------+------+
> |          in clear|Masked|
> +------------------+------+
> |       FATAH SABAH|HXTXXX|
> |       C HIGGINSON|GIHXXX|
> |           SOLTA A|XTLOSX|
> +------------------+------+
>
> This seems to work as it not only masks the name but also makes it
> consistent for all names (in other words, the same username gets the same
> mask).
>
> Are there any better alternatives?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Masking username in Spark with regexp_replace and reverse functions

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks guys.

All the analysis on windowing functions are done using the authentic names.
I only randomize names for the reporting purposes. So the figures tend to
be correct.

I agree with you Jorn that masking one name is not enough and one can
identify the row through transaction dates and the amount paid. Also most
tools these days tokenize the name and account numbers not realising that
certain information like mobile numbers are unique IDs.

In this case it is just a case study but for real world it will require
professional tools and approach.

Thanks again.

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 17 Mar 2019 at 09:50, Jörn Franke <jo...@gmail.com> wrote:

> For the approach below you have to check for collisions, ie different name
> lead to same masked value.
>
> You could hash it. However in order to avoid that one can just try
> different hashes you need to include in each name a different random
> factor.
>
> However, the anonymization problem is bigger, because based on other
> fields and correlation, the individual might still be identifiable.
>
> Am 16.03.2019 um 18:39 schrieb Mich Talebzadeh <mich.talebzadeh@gmail.com
> >:
>
> Hi,
>
> I am looking at Description column of a bank statement (CSV download) that
> has the following format
>
> scala> account_table.printSchema
> root
>  |-- TransactionDate: date (nullable = true)
>  |-- TransactionType: string (nullable = true)
>  |-- Description: string (nullable = true)
>  |-- Value: double (nullable = true)
>  |-- Balance: double (nullable = true)
>  |-- AccountName: string (nullable = true)
>  |-- AccountNumber: string (nullable = true)
>
> The column description for BACS payments contains the name of the
> individual who paid into the third party account. I need to mask the name
> but cannot simply use a literal as below for all contents of descriptions
> column!
>
> f1.withColumn("Description", lit("*** Masked
> ***")).select('Description.as("Who paid")
>
> So I try the following combination
>
> f1.select(trim(substring(substring_index('Description, ",",
> 1),2,50)).as("name in clear"),
> reverse(regexp_replace(regexp_replace(regexp_replace(substring(regexp_replace('Description,
> "^['A-Z]", "XX"),2,6),"[A-F]","X")," ","X"),"[,]","R")).as("Masked")).show
> +------------------+------+
> |          in clear|Masked|
> +------------------+------+
> |       FATAH SABAH|HXTXXX|
> |       C HIGGINSON|GIHXXX|
> |           SOLTA A|XTLOSX|
> +------------------+------+
>
> This seems to work as it not only masks the name but also makes it
> consistent for all names (in other words, the same username gets the same
> mask).
>
> Are there any better alternatives?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>

Re: Masking username in Spark with regexp_replace and reverse functions

Posted by Jörn Franke <jo...@gmail.com>.

For the approach below you have to check for collisions, ie different name lead to same masked value.

You could hash it. However in order to avoid that one can just try different hashes you need to include in each name a different random factor. 

However, the anonymization problem is bigger, because based on other fields and correlation, the individual might still be identifiable.

> Am 16.03.2019 um 18:39 schrieb Mich Talebzadeh <mi...@gmail.com>:
> 
> Hi,
> 
> I am looking at Description column of a bank statement (CSV download) that has the following format
> 
> scala> account_table.printSchema
> root
>  |-- TransactionDate: date (nullable = true)
>  |-- TransactionType: string (nullable = true)
>  |-- Description: string (nullable = true)
>  |-- Value: double (nullable = true)
>  |-- Balance: double (nullable = true)
>  |-- AccountName: string (nullable = true)
>  |-- AccountNumber: string (nullable = true)
> 
> The column description for BACS payments contains the name of the individual who paid into the third party account. I need to mask the name but cannot simply use a literal as below for all contents of descriptions column!
> 
> f1.withColumn("Description", lit("*** Masked ***")).select('Description.as("Who paid")
> 
> So I try the following combination
> 
> f1.select(trim(substring(substring_index('Description, ",", 1),2,50)).as("name in clear"),
> reverse(regexp_replace(regexp_replace(regexp_replace(substring(regexp_replace('Description, "^['A-Z]", "XX"),2,6),"[A-F]","X")," ","X"),"[,]","R")).as("Masked")).show
> +------------------+------+
> |          in clear|Masked|
> +------------------+------+
> |       FATAH SABAH|HXTXXX|
> |       C HIGGINSON|GIHXXX|
> |           SOLTA A|XTLOSX|
> +------------------+------+
> 
> This seems to work as it not only masks the name but also makes it consistent for all names (in other words, the same username gets the same mask).
> 
> Are there any better alternatives?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>