You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Walt Stoneburner <wa...@gmail.com> on 2007/03/08 16:28:59 UTC

Index a source, but not store it... can it be done?

Have an interesting scenario I'd like to get your take on with respect
to Lucene:

A data provider (e.g. someone with a private website or corporately
shared directory of proprietary documents) has requested their content
be indexed with Lucene so employees can be redirected to it, but
provisionally -- under no circumstance should that content be stored
or recreated from the index.

Is that even possible?

The data owner's request makes sense in the context of them wanting to
retain full access control via logins as well as collecting access
metrics.

If the token 'CAT' points to C:\Corporate\animals.doc and the token
'DOG' points also points there, then great, CAT AND DOG will give that
document a higher rating, though it is not possible to reconstruct
(with any great accuracy) what the actual document content is.

However, if for the sake of using the NEAR operator with Lucene the
tokens are stored as  LET'S:1 SELL:2 CAT:3 AND:4 DOG:5 ROBOT:6 TOYS:7
THIS:8 DECEMBER:9 ... then someone could pull all tokens for
animal.doc and reconstitute the token stream.

Does Lucene have any kind of trade off for working with "secure" (and
I use this term loosely) data?

-wls

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index a source, but not store it... can it be done?

Posted by Jason Pump <ja...@healthline.com>.
Agree it's totally hackable, particularly with an md5 hashcode. If you 
used a 16 bit hash e.g. mod % 65536 then it becomes more difficult to 
construct the original document but less precise in querying. It might 
be nice to store the individual words contained in each document as just 
a sorted list, perhaps with a count, and then for doing more then one 
word queries, use documents which contain those words and doing 
proximity against the 16 bit or even 8 bit hashes, with a two word query 
and an 8 bit hash you should only get collisions 1 in 2^16, with a 16 
bit hash only 1 in 2^32. So your two word query would be something like  
'+words:word1 +words:word2 +proximityField:"237 116"~30 '.

John Haxby wrote:
> Chris Hostetter wrote:
>> i'm not crypto expert, but i imagine it would probably take the same
>> amount of statistical guess work to reconstruct meaningful info from
>> either approach (hashing hte individual words compared to eliminating 
>> the
>> positions) so i would think the trade off of supporting phrase queries
>> would make the hasing approach more worthwhile.
>>   
> hashing the words sounds like a brilliant idea, but it's subject to a 
> known plaintext attack. Basically if I have the index and a document 
> in it I can compare the reconstructed document consisting of hashed 
> words and the original document. I can then say that "cat" is 
> "d077f244def8a70e5ea758bd8352fcd8", dog is 
> "06d80eb0c50b49a509b49f2424e8c805" and so on. I then have a choice of 
> attacking the hash algorithm (there are no prizes for guessing what I 
> used) or simply constructing a table of known words and hashes.
>
> Injecting my own documents into the corpus would let me fill in 
> missing words or test an hypothesis about the hash algorithm. It would 
> be fun to see how long it would take to reconstruct the index with no 
> hashed words in it :-) The best, really, that can be said about 
> hashing is that it either piques the interest of someone who would 
> otherwise ignore it and discourages the most casual of hackers.
>
> Setting all the token positions to zero helps quite a lot because you 
> then can't distinguish between "the cat and dog sat on the mat" and 
> "the cat sat on the dog on the mat" (more or less) and one is much 
> more interesting than the other. Of course, simply knowing that there 
> are documents that contain the words, oh, "airport" and "bomb" would 
> be interesting. What would be even more interesting would be comparing 
> the index at different times -- like knowing that the Pentagon ordered 
> many more pizzas the day before the air strikes in the Gulf, that kind 
> of thing. (Langley, incidentally, ordered the same number they always 
> ordered, but perhaps fewer were put in the trash unconsumed.) (This 
> might also be an urban myth, I haven't ever seen anything definitive 
> about it.) (But I digress.)
>
> It all depends on how much value the owner of the original documents 
> places on them and how much effort he thinks that a hacker might be 
> prepared to put into recovering the text.
>
> The best you're ever going to do is to protect the index as well as 
> you do the original documents.
>
> jch
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


-- 
Jason Pump
Technical Architect
Healthline
660 Third Street, Ste. 100
San Francisco, CA 94107
direct dial 415.281.3133
cell 510.812.1784
www.healthline.com 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index a source, but not store it... can it be done?

Posted by John Haxby <jc...@scalix.com>.
Chris Hostetter wrote:
> i'm not crypto expert, but i imagine it would probably take the same
> amount of statistical guess work to reconstruct meaningful info from
> either approach (hashing hte individual words compared to eliminating the
> positions) so i would think the trade off of supporting phrase queries
> would make the hasing approach more worthwhile.
>   
hashing the words sounds like a brilliant idea, but it's subject to a 
known plaintext attack. Basically if I have the index and a document in 
it I can compare the reconstructed document consisting of hashed words 
and the original document. I can then say that "cat" is 
"d077f244def8a70e5ea758bd8352fcd8", dog is 
"06d80eb0c50b49a509b49f2424e8c805" and so on. I then have a choice of 
attacking the hash algorithm (there are no prizes for guessing what I 
used) or simply constructing a table of known words and hashes.

Injecting my own documents into the corpus would let me fill in missing 
words or test an hypothesis about the hash algorithm. It would be fun to 
see how long it would take to reconstruct the index with no hashed words 
in it :-) The best, really, that can be said about hashing is that it 
either piques the interest of someone who would otherwise ignore it and 
discourages the most casual of hackers.

Setting all the token positions to zero helps quite a lot because you 
then can't distinguish between "the cat and dog sat on the mat" and "the 
cat sat on the dog on the mat" (more or less) and one is much more 
interesting than the other. Of course, simply knowing that there are 
documents that contain the words, oh, "airport" and "bomb" would be 
interesting. What would be even more interesting would be comparing the 
index at different times -- like knowing that the Pentagon ordered many 
more pizzas the day before the air strikes in the Gulf, that kind of 
thing. (Langley, incidentally, ordered the same number they always 
ordered, but perhaps fewer were put in the trash unconsumed.) (This 
might also be an urban myth, I haven't ever seen anything definitive 
about it.) (But I digress.)

It all depends on how much value the owner of the original documents 
places on them and how much effort he thinks that a hacker might be 
prepared to put into recovering the text.

The best you're ever going to do is to protect the index as well as you 
do the original documents.

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index a source, but not store it... can it be done?

Posted by Mike Klaas <mi...@gmail.com>.
On 3/8/07, Chris Hostetter <ho...@fucit.org> wrote:

> if the issue is thta you want to be abel to ship an index that people can
> manipulate as much as they want and you want to garuntee they can never
> reconstruct the original docs you're pretty much screwed ... even if you
> eliminate all of the position info statistical info about language
> structure can help you gleen a lot about hte source data.

True.

> i'm not crypto expert, but i imagine it would probably take the same
> amount of statistical guess work to reconstruct meaningful info from
> either approach (hashing hte individual words compared to eliminating the
> positions) so i would think the trade off of supporting phrase queries
> would make the hasing approach more worthwhile.

I suppose it also depends on how much access the user has to the
index.  If they have access to the physical index and means of
querying it, then they have access to the hashing algo (and/or key)
and so it is worthless.  If they don't, and their access is strictly
through queries, then I don't see what help hashing will provide, as
the result of any given query should be the same, hashing or not.

> i mean afterall: you still wnat the index to be useful for searching
> right? ... if you are really paranoid don't just strip the positions,
> strip all duplicate terms as well to prevent any attempt at statistical
> sampling ... but now all you relaly have is a lookup table of word to
> docid with no tf/idf or position info to improve scoring, so why bother
> with Lucene, jsut use a BerkleyDB file to do your lookups.

You could also do both.  Another thing that might help is relatively
aggressive stop word removal.  All these measures will raise the
"discouragement" bar slightly.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index a source, but not store it... can it be done?

Posted by Doron Cohen <DO...@il.ibm.com>.
> i mean afterall: you still wnat the index to be useful for searching
> right? ... if you are really paranoid don't just strip the positions,
> strip all duplicate terms as well to prevent any attempt at statistical
> sampling ... but now all you relaly have is a lookup table of word to

That's right, once offsets info is discarded, phrase/spans search is not
possible. With the token obfuscation approach (obfuscate only after
stemming/normalizing at both indexing and search time) phrase/spans queries
work, but not so wildcard queries. To me, phrase/span queries seems more
important than wildcard queries, but this really depends on the application
in question. Security wise, I think both solutions will not be considered
safe by any security expert.

> docid with no tf/idf or position info to improve scoring, so why bother
> with Lucene, jsut use a BerkleyDB file to do your lookups.

With tf info in place Lucene search quality would be far beyond that of DB
lookup. In fact search quality is preserved, right? (except that
phrase/span queries don't work)





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index a source, but not store it... can it be done?

Posted by Chris Hostetter <ho...@fucit.org>.
: I don't know... hashing individual words is an extremely weak form of
: security that should be breakable without even using a computer... all
: the statistical information is still there (somewhat like 'encrypting'
: a message as a cryptoquote).
:
: Doron's suggestion is preferable: eliminate token position information
: from the index entirely.

i guess i wasn't thinking about this as a "security" issue, more a
"discouragement" issue ... reconstructing a doc from term vectors is easy,
reconstructing it from just term positions is harder but not impossible,
reconstructing from hashed tokens requires a lot of hard work.

if the issue is thta you want to be abel to ship an index that people can
manipulate as much as they want and you want to garuntee they can never
reconstruct the original docs you're pretty much screwed ... even if you
eliminate all of the position info statistical info about language
structure can help you gleen a lot about hte source data.

i'm not crypto expert, but i imagine it would probably take the same
amount of statistical guess work to reconstruct meaningful info from
either approach (hashing hte individual words compared to eliminating the
positions) so i would think the trade off of supporting phrase queries
would make the hasing approach more worthwhile.

i mean afterall: you still wnat the index to be useful for searching
right? ... if you are really paranoid don't just strip the positions,
strip all duplicate terms as well to prevent any attempt at statistical
sampling ... but now all you relaly have is a lookup table of word to
docid with no tf/idf or position info to improve scoring, so why bother
with Lucene, jsut use a BerkleyDB file to do your lookups.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index a source, but not store it... can it be done?

Posted by Mike Klaas <mi...@gmail.com>.
On 3/8/07, Chris Hostetter <ho...@fucit.org> wrote:
> : If you store a hash code of the word rather then the actual word you
> : should be able to search for stuff but not be able to actually retrieve
>
> that's a really great solution ... it could even be implemented asa
> TokenFilter so none of your client code would ever even need to know that
> it was being used (just make sure it comes last after any stemming or what
> not)

I don't know... hashing individual words is an extremely weak form of
security that should be breakable without even using a computer... all
the statistical information is still there (somewhat like 'encrypting'
a message as a cryptoquote).

Doron's suggestion is preferable: eliminate token position information
from the index entirely.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index a source, but not store it... can it be done?

Posted by Chris Hostetter <ho...@fucit.org>.
: If you store a hash code of the word rather then the actual word you
: should be able to search for stuff but not be able to actually retrieve

that's a really great solution ... it could even be implemented asa
TokenFilter so none of your client code would ever even need to know that
it was being used (just make sure it comes last after any stemming or what
not)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index a source, but not store it... can it be done?

Posted by Jason Pump <jp...@mindspring.com>.
If you store a hash code of the word rather then the actual word you 
should be able to search for stuff but not be able to actually retrieve 
it; you can trade precision for "security" based on the number of bits 
in the hash code ( e.g. 32 or 64 bits). I'd think a 64 bit hash would be 
a reasonable midpoint.

hash64("dog") = 4312311231123121;

"body:4312311231123121" returns document with dog, but also any other 
document with a word that hashes to the same value.


Walt Stoneburner wrote:
> Have an interesting scenario I'd like to get your take on with respect
> to Lucene:
>
> A data provider (e.g. someone with a private website or corporately
> shared directory of proprietary documents) has requested their content
> be indexed with Lucene so employees can be redirected to it, but
> provisionally -- under no circumstance should that content be stored
> or recreated from the index.
>
> Is that even possible?
>
> The data owner's request makes sense in the context of them wanting to
> retain full access control via logins as well as collecting access
> metrics.
>
> If the token 'CAT' points to C:\Corporate\animals.doc and the token
> 'DOG' points also points there, then great, CAT AND DOG will give that
> document a higher rating, though it is not possible to reconstruct
> (with any great accuracy) what the actual document content is.
>
> However, if for the sake of using the NEAR operator with Lucene the
> tokens are stored as  LET'S:1 SELL:2 CAT:3 AND:4 DOG:5 ROBOT:6 TOYS:7
> THIS:8 DECEMBER:9 ... then someone could pull all tokens for
> animal.doc and reconstitute the token stream.
>
> Does Lucene have any kind of trade off for working with "secure" (and
> I use this term loosely) data?
>
> -wls
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index a source, but not store it... can it be done?

Posted by Doron Cohen <DO...@il.ibm.com>.
Token positions are used also for phrase search.

You could probably compromise this by setting all token positions to 0 -
this would appear as if a document is a *set* of words (rather than a
*list*). An adversary would be able to know/guess what words are in each
document, (and, with (API) access to the index itself, how many times each
word appear in each document), but would not be able to reconstruct a
"good" approximation of that document, because term positions are all 0. If
this is sufficient, I think you can do it by writing your own Analyzer with
a TokenFilter that takes care of the position - see Token.
setPositionIncrement().

Hope this helps,
Doron

"Walt Stoneburner" <wa...@gmail.com> wrote on 08/03/2007
07:28:59:

> Have an interesting scenario I'd like to get your take on with respect
> to Lucene:
>
> A data provider (e.g. someone with a private website or corporately
> shared directory of proprietary documents) has requested their content
> be indexed with Lucene so employees can be redirected to it, but
> provisionally -- under no circumstance should that content be stored
> or recreated from the index.
>
> Is that even possible?
>
> The data owner's request makes sense in the context of them wanting to
> retain full access control via logins as well as collecting access
> metrics.
>
> If the token 'CAT' points to C:\Corporate\animals.doc and the token
> 'DOG' points also points there, then great, CAT AND DOG will give that
> document a higher rating, though it is not possible to reconstruct
> (with any great accuracy) what the actual document content is.
>
> However, if for the sake of using the NEAR operator with Lucene the
> tokens are stored as  LET'S:1 SELL:2 CAT:3 AND:4 DOG:5 ROBOT:6 TOYS:7
> THIS:8 DECEMBER:9 ... then someone could pull all tokens for
> animal.doc and reconstitute the token stream.
>
> Does Lucene have any kind of trade off for working with "secure" (and
> I use this term loosely) data?
>
> -wls


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org