You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pirk.apache.org by James Carr <jr...@gmail.com> on 2016/09/22 15:23:33 UTC

Idea for handling result matrix overflows

Hello all,

  I've been working through the Wideskies paper and Walter's slides. I
think I have an idea that would improve usability by a lot, but I want to
run it past you guys to see if it would work.

Here's my understanding of how the Wideskies Responder works:
- It maintains the values Y_0, ..., Y_(r-1) which hold the encrypted data
returned by the algorithm. From a Comp Sci perspective, I see the
collection of these values functioning as, basically, a big encrypted byte
buffer holding our results. It also maintains a list of counters c_0, ...,
c_t, one for each possible value of hash function H.
- When a pair (T, D) comes in, we hash T to get H(T), and break D into
8-bit chunks. We then combine each of those chunks into our Y_i values
using E_H(T), our encrypted value for the hash of T, which is chosen in
such a way that a) if we aren't interested in T, adding this chunk to Y_i
does not change the value it decrypts to (we essentially multiply by
encrypted 1); or b) if we are interested in T, E_H(T) has been chosen in
such a way that adding this chunk to Y_i will not overwrite any byte chunks
for *other* interesting terms that have also been written to Y_i.
- We then increment the counter c_T so that future items that hash to the
same value write to different Y_i, so that we do not overwrite the value we
just wrote.
- If *any *of these counters c_H(T) would exceed r due to some new pair
(T,D), then we can no longer ensure that we won't overwrite data we already
saw when we add the new D to Y_0, ..., Y_(r-1). From a Comp Sci
perspective, this looks like "our buffer is full," since we don't have room
for new data. Therefore, we have to terminate the algorithm and return our
list of Y's as they stand, without processing all of the data. This is true
even if most of the terms that hashed to H(T) were ones we weren't
interested in; in other words, if ANY of the terms are extremely common,
even if we aren't interested in them, we run the risk of being unable to
process all of the data.

Please let me know if I'm off the mark on any of that :-)

So my idea is as follows: In programming world, when we are processing a
large amount of data and we fill up our current write buffer, we just
output that buffer, make a new one, and continue. Why wouldn't that work
here? In other words, when we encounter a term (T, D) that would cause one
of the counters c_H(T) to exceed r, why couldn't we just push the current
Y_0, ..., Y_(r-1) entries onto a list, create a new set of entries Y'_0,
..., Y'_(r-1), reset all the counters to 0, and continue processing (T,D)
as normal? Then the algorithm could return all of the buffers Y, Y', Y'',
etc. that it created instead of just Y, and we can ensure that we can
process all of the data.

As far as I could tell, this requires giving the responder no additional
information about our query terms that they didn't already have. So I don't
think it would weaken the privacy of the algorithm at all, but please
correct me if I'm missing something. On the other hand I think it would
make PIRK much easier to use, since we don't run the risk of a query
failing if we don't understand enough about the data it's running over to
be able to tune r appropriately. I can also imagine this leading to some
performance improvements, since we could let PIRK choose r automatically so
that a result set could fit nicely within a hard disk page, network frame,
etc.

What do you guys think?

Regards,
-Ryan Carr

Re: Idea for handling result matrix overflows

Posted by Ellison Anne Williams <ea...@apache.org>.
This is exactly the way the current streaming implementations work. In the
batch case, as Walter noted, r can be set very large (large enough to
encompass all possible data in a batch case). If you did not know the size
of your data a priori in a batch case (say, in processing in Spark with
data in hdfs), you can just use the queueStream option built into the
SparkStreaming responder to give the buffer-like capability that you
described.

Thus, I think that between the streaming 'buffer' implementation and the
variable 'r' setting in the batch case, the functionality of what you have
described is taken care of.



On Thu, Sep 22, 2016 at 4:06 PM, Walter Ray-Dulany <ra...@gmail.com>
wrote:

> I would like to hear Ellison Anne's opinion on this, but I think such a
> change has no security implications.
>
> While it would be a definite change in how the algorithm operates, from a
> mathematical-security perspective this is no different than picking an r
> value that it simply enormous (recall that r is the number of responses
> that can be returned per query request period per search term). The value
> of r is a constrainer, not a constraint, in this system (i.e. we are forced
> to choose delta and b such that delta/b divides r, but we've got no
> restrictions on our choice of r).
>
> On Thu, Sep 22, 2016 at 11:23 AM, James Carr <jr...@gmail.com>
> wrote:
>
> > Hello all,
> >
> >   I've been working through the Wideskies paper and Walter's slides. I
> > think I have an idea that would improve usability by a lot, but I want to
> > run it past you guys to see if it would work.
> >
> > Here's my understanding of how the Wideskies Responder works:
> > - It maintains the values Y_0, ..., Y_(r-1) which hold the encrypted data
> > returned by the algorithm. From a Comp Sci perspective, I see the
> > collection of these values functioning as, basically, a big encrypted
> byte
> > buffer holding our results. It also maintains a list of counters c_0,
> ...,
> > c_t, one for each possible value of hash function H.
> > - When a pair (T, D) comes in, we hash T to get H(T), and break D into
> > 8-bit chunks. We then combine each of those chunks into our Y_i values
> > using E_H(T), our encrypted value for the hash of T, which is chosen in
> > such a way that a) if we aren't interested in T, adding this chunk to Y_i
> > does not change the value it decrypts to (we essentially multiply by
> > encrypted 1); or b) if we are interested in T, E_H(T) has been chosen in
> > such a way that adding this chunk to Y_i will not overwrite any byte
> chunks
> > for *other* interesting terms that have also been written to Y_i.
> > - We then increment the counter c_T so that future items that hash to the
> > same value write to different Y_i, so that we do not overwrite the value
> we
> > just wrote.
> > - If *any *of these counters c_H(T) would exceed r due to some new pair
> > (T,D), then we can no longer ensure that we won't overwrite data we
> already
> > saw when we add the new D to Y_0, ..., Y_(r-1). From a Comp Sci
> > perspective, this looks like "our buffer is full," since we don't have
> room
> > for new data. Therefore, we have to terminate the algorithm and return
> our
> > list of Y's as they stand, without processing all of the data. This is
> true
> > even if most of the terms that hashed to H(T) were ones we weren't
> > interested in; in other words, if ANY of the terms are extremely common,
> > even if we aren't interested in them, we run the risk of being unable to
> > process all of the data.
> >
> > Please let me know if I'm off the mark on any of that :-)
> >
> > So my idea is as follows: In programming world, when we are processing a
> > large amount of data and we fill up our current write buffer, we just
> > output that buffer, make a new one, and continue. Why wouldn't that work
> > here? In other words, when we encounter a term (T, D) that would cause
> one
> > of the counters c_H(T) to exceed r, why couldn't we just push the current
> > Y_0, ..., Y_(r-1) entries onto a list, create a new set of entries Y'_0,
> > ..., Y'_(r-1), reset all the counters to 0, and continue processing (T,D)
> > as normal? Then the algorithm could return all of the buffers Y, Y', Y'',
> > etc. that it created instead of just Y, and we can ensure that we can
> > process all of the data.
> >
> > As far as I could tell, this requires giving the responder no additional
> > information about our query terms that they didn't already have. So I
> don't
> > think it would weaken the privacy of the algorithm at all, but please
> > correct me if I'm missing something. On the other hand I think it would
> > make PIRK much easier to use, since we don't run the risk of a query
> > failing if we don't understand enough about the data it's running over to
> > be able to tune r appropriately. I can also imagine this leading to some
> > performance improvements, since we could let PIRK choose r automatically
> so
> > that a result set could fit nicely within a hard disk page, network
> frame,
> > etc.
> >
> > What do you guys think?
> >
> > Regards,
> > -Ryan Carr
> >
>

Re: Idea for handling result matrix overflows

Posted by Walter Ray-Dulany <ra...@gmail.com>.
I would like to hear Ellison Anne's opinion on this, but I think such a
change has no security implications.

While it would be a definite change in how the algorithm operates, from a
mathematical-security perspective this is no different than picking an r
value that it simply enormous (recall that r is the number of responses
that can be returned per query request period per search term). The value
of r is a constrainer, not a constraint, in this system (i.e. we are forced
to choose delta and b such that delta/b divides r, but we've got no
restrictions on our choice of r).

On Thu, Sep 22, 2016 at 11:23 AM, James Carr <jr...@gmail.com>
wrote:

> Hello all,
>
>   I've been working through the Wideskies paper and Walter's slides. I
> think I have an idea that would improve usability by a lot, but I want to
> run it past you guys to see if it would work.
>
> Here's my understanding of how the Wideskies Responder works:
> - It maintains the values Y_0, ..., Y_(r-1) which hold the encrypted data
> returned by the algorithm. From a Comp Sci perspective, I see the
> collection of these values functioning as, basically, a big encrypted byte
> buffer holding our results. It also maintains a list of counters c_0, ...,
> c_t, one for each possible value of hash function H.
> - When a pair (T, D) comes in, we hash T to get H(T), and break D into
> 8-bit chunks. We then combine each of those chunks into our Y_i values
> using E_H(T), our encrypted value for the hash of T, which is chosen in
> such a way that a) if we aren't interested in T, adding this chunk to Y_i
> does not change the value it decrypts to (we essentially multiply by
> encrypted 1); or b) if we are interested in T, E_H(T) has been chosen in
> such a way that adding this chunk to Y_i will not overwrite any byte chunks
> for *other* interesting terms that have also been written to Y_i.
> - We then increment the counter c_T so that future items that hash to the
> same value write to different Y_i, so that we do not overwrite the value we
> just wrote.
> - If *any *of these counters c_H(T) would exceed r due to some new pair
> (T,D), then we can no longer ensure that we won't overwrite data we already
> saw when we add the new D to Y_0, ..., Y_(r-1). From a Comp Sci
> perspective, this looks like "our buffer is full," since we don't have room
> for new data. Therefore, we have to terminate the algorithm and return our
> list of Y's as they stand, without processing all of the data. This is true
> even if most of the terms that hashed to H(T) were ones we weren't
> interested in; in other words, if ANY of the terms are extremely common,
> even if we aren't interested in them, we run the risk of being unable to
> process all of the data.
>
> Please let me know if I'm off the mark on any of that :-)
>
> So my idea is as follows: In programming world, when we are processing a
> large amount of data and we fill up our current write buffer, we just
> output that buffer, make a new one, and continue. Why wouldn't that work
> here? In other words, when we encounter a term (T, D) that would cause one
> of the counters c_H(T) to exceed r, why couldn't we just push the current
> Y_0, ..., Y_(r-1) entries onto a list, create a new set of entries Y'_0,
> ..., Y'_(r-1), reset all the counters to 0, and continue processing (T,D)
> as normal? Then the algorithm could return all of the buffers Y, Y', Y'',
> etc. that it created instead of just Y, and we can ensure that we can
> process all of the data.
>
> As far as I could tell, this requires giving the responder no additional
> information about our query terms that they didn't already have. So I don't
> think it would weaken the privacy of the algorithm at all, but please
> correct me if I'm missing something. On the other hand I think it would
> make PIRK much easier to use, since we don't run the risk of a query
> failing if we don't understand enough about the data it's running over to
> be able to tune r appropriately. I can also imagine this leading to some
> performance improvements, since we could let PIRK choose r automatically so
> that a result set could fit nicely within a hard disk page, network frame,
> etc.
>
> What do you guys think?
>
> Regards,
> -Ryan Carr
>