You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Tarala, Magesh" <MT...@bh.com> on 2015/10/08 02:09:06 UTC

Scramble data

Folks,
I have a strange question. We have a Solr implementation that we would like to demo to external customers. But we don't want to display the real data, which contains our customer information and so is sensitive data. What's the best way to scramble the data of the Solr Query results? By best I mean the simplest way with least amount of work. BTW, we have a .NET front end application.

Thanks,
Magesh




Re: Scramble data

Posted by Roman Chyla <ro...@gmail.com>.
Or you could also apply XSL to returned records:
https://wiki.apache.org/solr/XsltResponseWriter


On Thu, Oct 8, 2015 at 5:06 PM, Uwe Reh <re...@hebis.uni-frankfurt.de> wrote:
> Hi,
>
> my suggestions are probably to simple, because they are not a real
> protection of privacy. But maybe one fits to your needs.
>
> Most simple:
> Declare your 'hidden' fields just as "indexed=true stored=false", the data
> will be used for searching, but the fields are not listed in the query
> response.
> Cons: The Terms of the fields can be still examined by advanced users. As
> example they could use the field as facet.
>
> Very simple
> Use a PhoneticFilter for indexing and searching. The encoding
> "ColognePhonetic" generates a numeric hash for each term. The name
> "Breschnew" will be saved as "17863".
> Cons: Phonetic similaritys will lead to false hits. This hashing is really
> only scrambling and not appropriate as security feature.
>
> Simple
> Declare a special SearchHandlers in your solrconfig.xml and define an
> invariant fieldList parameter. This should contain just the public subset of
> your fields.
> Cons: I'm not really sure, about this.
>
> Still quite simple
> Write a own Filter, which generates real cryptographic hashes
> Cons: If the entropy of your data is poor, you may need additional tricks
> like padding the data. This filter may slow down your system.
>
>
> Last but not least be aware, that the searching could be a way to restore
> hidden informations. If a query for "billionaire" just get one hit, it's
> obvious that "billionaire" is an attribute of the document even if it is not
> listed in the result.
>
> Uwe

Re: Scramble data

Posted by Uwe Reh <re...@hebis.uni-frankfurt.de>.
Hi,

my suggestions are probably to simple, because they are not a real 
protection of privacy. But maybe one fits to your needs.

Most simple:
Declare your 'hidden' fields just as "indexed=true stored=false", the 
data will be used for searching, but the fields are not listed in the 
query response.
Cons: The Terms of the fields can be still examined by advanced users. 
As example they could use the field as facet.

Very simple
Use a PhoneticFilter for indexing and searching. The encoding 
"ColognePhonetic" generates a numeric hash for each term. The name 
"Breschnew" will be saved as "17863".
Cons: Phonetic similaritys will lead to false hits. This hashing is 
really only scrambling and not appropriate as security feature.

Simple
Declare a special SearchHandlers in your solrconfig.xml and define an 
invariant fieldList parameter. This should contain just the public 
subset of your fields.
Cons: I'm not really sure, about this.

Still quite simple
Write a own Filter, which generates real cryptographic hashes
Cons: If the entropy of your data is poor, you may need additional 
tricks like padding the data. This filter may slow down your system.


Last but not least be aware, that the searching could be a way to 
restore hidden informations. If a query for "billionaire" just get one 
hit, it's obvious that "billionaire" is an attribute of the document 
even if it is not listed in the result.

Uwe

Re: Scramble data

Posted by Susheel Kumar <su...@gmail.com>.
Like Erick said,  would something like using replace function on individual
sensitive fields in fl param would work? replacing to something REDACTED
etc.

On Thu, Oct 8, 2015 at 2:58 PM, Tarala, Magesh <MT...@bh.com> wrote:

> I already have the data ingested and it takes several days to do that. I
> was trying to avoid re-ingesting the data.
>
> Thanks,
> Magesh
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, October 07, 2015 9:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Scramble data
>
> Probably sanitize the data on the front end? Something simple like put
> "REDACTED" for all of the customer-sensitive fields.
>
> You might also write a DocTransformer plugin, all you have to do is
> implement subclass DocTransformer and override one very simple "transform"
> method,
>
> Best,
> Erick
>
> On Wed, Oct 7, 2015 at 5:09 PM, Tarala, Magesh <MT...@bh.com> wrote:
> > Folks,
> > I have a strange question. We have a Solr implementation that we would
> like to demo to external customers. But we don't want to display the real
> data, which contains our customer information and so is sensitive data.
> What's the best way to scramble the data of the Solr Query results? By best
> I mean the simplest way with least amount of work. BTW, we have a .NET
> front end application.
> >
> > Thanks,
> > Magesh
> >
> >
> >
>

RE: Scramble data

Posted by "Tarala, Magesh" <MT...@bh.com>.
I already have the data ingested and it takes several days to do that. I was trying to avoid re-ingesting the data. 

Thanks,
Magesh

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Wednesday, October 07, 2015 9:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Scramble data

Probably sanitize the data on the front end? Something simple like put "REDACTED" for all of the customer-sensitive fields.

You might also write a DocTransformer plugin, all you have to do is implement subclass DocTransformer and override one very simple "transform" method,

Best,
Erick

On Wed, Oct 7, 2015 at 5:09 PM, Tarala, Magesh <MT...@bh.com> wrote:
> Folks,
> I have a strange question. We have a Solr implementation that we would like to demo to external customers. But we don't want to display the real data, which contains our customer information and so is sensitive data. What's the best way to scramble the data of the Solr Query results? By best I mean the simplest way with least amount of work. BTW, we have a .NET front end application.
>
> Thanks,
> Magesh
>
>
>

Re: Scramble data

Posted by Erick Erickson <er...@gmail.com>.
Probably sanitize the data on the front end? Something simple like put
"REDACTED" for all of the customer-sensitive fields.

You might also write a DocTransformer plugin, all you have to do is
implement subclass DocTransformer and override one
very simple "transform" method,

Best,
Erick

On Wed, Oct 7, 2015 at 5:09 PM, Tarala, Magesh <MT...@bh.com> wrote:
> Folks,
> I have a strange question. We have a Solr implementation that we would like to demo to external customers. But we don't want to display the real data, which contains our customer information and so is sensitive data. What's the best way to scramble the data of the Solr Query results? By best I mean the simplest way with least amount of work. BTW, we have a .NET front end application.
>
> Thanks,
> Magesh
>
>
>

Re: Scramble data

Posted by Doug Turnbull <dt...@opensourceconnections.com>.
Can you just generate a fake data set for testing? There are numerous
libraries that create fake names, phone numbers, etc that you can use to
create mock data. Faker is one we have used in sensitive situations

https://github.com/joke2k/faker

I think this is going to be a better long-term solution than trying to play
around with possibly sensitive info.

-Doug

On Wednesday, October 7, 2015, Tarala, Magesh <MT...@bh.com> wrote:

> Folks,
> I have a strange question. We have a Solr implementation that we would
> like to demo to external customers. But we don't want to display the real
> data, which contains our customer information and so is sensitive data.
> What's the best way to scramble the data of the Solr Query results? By best
> I mean the simplest way with least amount of work. BTW, we have a .NET
> front end application.
>
> Thanks,
> Magesh
>
>
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.