You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Josh Elser <jo...@gmail.com> on 2016/10/11 19:20:09 UTC

[DISCUSS] Would a visibility histogram on a table be harmful?

Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he 
mentioned was the lack of insight into the distribution of data marked 
with certain visibilities in a table. He presented an example similar to 
this:

Image a hypothetical system backed by Accumulo which stores medical 
information. There are three labels in the system: PRIVATE, ANONYMIZED, 
and PUBLIC. PRIVATE data is that which could reasonably be considered to 
identify the individual. ANONYMIZED data is some altered version of the 
attribute that retains some portion of the original value, but is 
missing enough context to not identify the individual (e.g. converting 
the name "Josh Elser" to "J E"). PUBLIC data is for attributes which are 
cannot identify the individual.

Doctors would be able to read the PRIVATE data, while researchers could 
only read the ANONYMIZED and PUBLIC data. This leads to a question: how 
much of each kind of data is in the system? Without knowing how much 
data is in the system, how can some application developer (who does not 
have the ability to read all of the PRIVATE data) know that their 
application is returning an reasonably correct amount of data? (there 
are many examples of questions which could be answer on this data alone)

Concretely, this histogram would look like (50 records with PRIVATE, 50 
with ANONYMIZED, and 20 with PUBLIC; 120 records total):

```
PRIVATE: 50
ANONYMIZED: 50
PUBLIC: 20
```

Technically, I think this would actually be relatively simple to 
implement. Inside of each RFile, we could maintain some histogram of the 
visibilities observed in that file. This would allow us to very easily 
report how much data in each table has each visibility label.

However, would this feature be harmful to one of the core tenants of 
Accumulo? Or, is acknowledging the existence of data in Accumulo with a 
certain visibility acceptable? Would a new permission to use such an API 
to access this information be sufficient to protect the data?

- Josh

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Russ Weeks <rw...@newbrightidea.com>.
> I've always been under the impression that accumulo was not supposed to confirm
the existence of data that a user did not have permission to read.

OK, that makes sense, I can see the need for that. But if we follow this
path of keeping the summary data structure in the RFile header (footer?)
then it's just a convenience that's available to anybody who can read the
RFile. At that point it seems like it's just a question of who else should
be allowed to read it and how to grant that access. A system permission
makes a lot of sense to me.

-Russ


On Tue, Oct 11, 2016 at 4:33 PM Mike Drob <md...@mdrob.com> wrote:

> I've always been under the impression that accumulo was not supposed to
> confirm the existence of data that a user did not have permission to read.
>
> On Tue, Oct 11, 2016, 2:20 PM Josh Elser <jo...@gmail.com> wrote:
>
> > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he
> > mentioned was the lack of insight into the distribution of data marked
> > with certain visibilities in a table. He presented an example similar to
> > this:
> >
> > Image a hypothetical system backed by Accumulo which stores medical
> > information. There are three labels in the system: PRIVATE, ANONYMIZED,
> > and PUBLIC. PRIVATE data is that which could reasonably be considered to
> > identify the individual. ANONYMIZED data is some altered version of the
> > attribute that retains some portion of the original value, but is
> > missing enough context to not identify the individual (e.g. converting
> > the name "Josh Elser" to "J E"). PUBLIC data is for attributes which are
> > cannot identify the individual.
> >
> > Doctors would be able to read the PRIVATE data, while researchers could
> > only read the ANONYMIZED and PUBLIC data. This leads to a question: how
> > much of each kind of data is in the system? Without knowing how much
> > data is in the system, how can some application developer (who does not
> > have the ability to read all of the PRIVATE data) know that their
> > application is returning an reasonably correct amount of data? (there
> > are many examples of questions which could be answer on this data alone)
> >
> > Concretely, this histogram would look like (50 records with PRIVATE, 50
> > with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> >
> > ```
> > PRIVATE: 50
> > ANONYMIZED: 50
> > PUBLIC: 20
> > ```
> >
> > Technically, I think this would actually be relatively simple to
> > implement. Inside of each RFile, we could maintain some histogram of the
> > visibilities observed in that file. This would allow us to very easily
> > report how much data in each table has each visibility label.
> >
> > However, would this feature be harmful to one of the core tenants of
> > Accumulo? Or, is acknowledging the existence of data in Accumulo with a
> > certain visibility acceptable? Would a new permission to use such an API
> > to access this information be sufficient to protect the data?
> >
> > - Josh
> >
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Sean Busbey <bu...@cloudera.com>.
I think a new permission would cover the concern about leaking
meta-information. Even if only the administrative user could see the
histogram (since they can see all data), that'd be a gain.

-- 
Sean Busbey

On Oct 11, 2016 16:33, "Mike Drob" <md...@mdrob.com> wrote:

> I've always been under the impression that accumulo was not supposed to
> confirm the existence of data that a user did not have permission to read.
>
> On Tue, Oct 11, 2016, 2:20 PM Josh Elser <jo...@gmail.com> wrote:
>
> > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he
> > mentioned was the lack of insight into the distribution of data marked
> > with certain visibilities in a table. He presented an example similar to
> > this:
> >
> > Image a hypothetical system backed by Accumulo which stores medical
> > information. There are three labels in the system: PRIVATE, ANONYMIZED,
> > and PUBLIC. PRIVATE data is that which could reasonably be considered to
> > identify the individual. ANONYMIZED data is some altered version of the
> > attribute that retains some portion of the original value, but is
> > missing enough context to not identify the individual (e.g. converting
> > the name "Josh Elser" to "J E"). PUBLIC data is for attributes which are
> > cannot identify the individual.
> >
> > Doctors would be able to read the PRIVATE data, while researchers could
> > only read the ANONYMIZED and PUBLIC data. This leads to a question: how
> > much of each kind of data is in the system? Without knowing how much
> > data is in the system, how can some application developer (who does not
> > have the ability to read all of the PRIVATE data) know that their
> > application is returning an reasonably correct amount of data? (there
> > are many examples of questions which could be answer on this data alone)
> >
> > Concretely, this histogram would look like (50 records with PRIVATE, 50
> > with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> >
> > ```
> > PRIVATE: 50
> > ANONYMIZED: 50
> > PUBLIC: 20
> > ```
> >
> > Technically, I think this would actually be relatively simple to
> > implement. Inside of each RFile, we could maintain some histogram of the
> > visibilities observed in that file. This would allow us to very easily
> > report how much data in each table has each visibility label.
> >
> > However, would this feature be harmful to one of the core tenants of
> > Accumulo? Or, is acknowledging the existence of data in Accumulo with a
> > certain visibility acceptable? Would a new permission to use such an API
> > to access this information be sufficient to protect the data?
> >
> > - Josh
> >
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Mike Drob <md...@mdrob.com>.
I've always been under the impression that accumulo was not supposed to
confirm the existence of data that a user did not have permission to read.

On Tue, Oct 11, 2016, 2:20 PM Josh Elser <jo...@gmail.com> wrote:

> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he
> mentioned was the lack of insight into the distribution of data marked
> with certain visibilities in a table. He presented an example similar to
> this:
>
> Image a hypothetical system backed by Accumulo which stores medical
> information. There are three labels in the system: PRIVATE, ANONYMIZED,
> and PUBLIC. PRIVATE data is that which could reasonably be considered to
> identify the individual. ANONYMIZED data is some altered version of the
> attribute that retains some portion of the original value, but is
> missing enough context to not identify the individual (e.g. converting
> the name "Josh Elser" to "J E"). PUBLIC data is for attributes which are
> cannot identify the individual.
>
> Doctors would be able to read the PRIVATE data, while researchers could
> only read the ANONYMIZED and PUBLIC data. This leads to a question: how
> much of each kind of data is in the system? Without knowing how much
> data is in the system, how can some application developer (who does not
> have the ability to read all of the PRIVATE data) know that their
> application is returning an reasonably correct amount of data? (there
> are many examples of questions which could be answer on this data alone)
>
> Concretely, this histogram would look like (50 records with PRIVATE, 50
> with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>
> ```
> PRIVATE: 50
> ANONYMIZED: 50
> PUBLIC: 20
> ```
>
> Technically, I think this would actually be relatively simple to
> implement. Inside of each RFile, we could maintain some histogram of the
> visibilities observed in that file. This would allow us to very easily
> report how much data in each table has each visibility label.
>
> However, would this feature be harmful to one of the core tenants of
> Accumulo? Or, is acknowledging the existence of data in Accumulo with a
> certain visibility acceptable? Would a new permission to use such an API
> to access this information be sufficient to protect the data?
>
> - Josh
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Josh Elser <jo...@gmail.com>.
Hah, funny you mention custom RFile index. I think Adam Fuchs had 
proposed an idea before similar (probably years ago now) :)

re: the monitor, I was more thinking that it would just be an API call 
to access it. I had not thought about automatically displaying it on the 
monitor (but it is an interesting idea...)

I remember making a ticket a while back to move the RFile header from a 
custom serialized object to a Thrift or Protobuf object which would make 
handling such a drift in "schema" dirt-simple to handle. Eventually 
there's a concern about putting too much data in there (probably 
reachable with a large number of visibilities -- implementation detail), 
but that's a related thought :)

Dylan Hutchison wrote:
> Interesting idea.  It begs the question: should we allow any custom index
> at the RFile level?  If RFile indexes were user-extensible, then a
> visibility index would be something any developer could write.  That said,
> we can still include such an index as an example, and if we did it could be
> used by the Accumulo monitor.
>
> The RFile-level sampling followed this path.  I would support further work
> similar to it, though I admit I don't know how difficult a job it entails.
> Bonus points if the index information could be accessed from iterators the
> same way that sampled data can.
>
> I can't speak to the appropriateness of visibility histograms on the
> monitor *by default*, but it would be a strictly useful feature if it could
> be enabled via a conf option.
>
>
> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<jo...@gmail.com>  wrote:
>
>> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he
>> mentioned was the lack of insight into the distribution of data marked with
>> certain visibilities in a table. He presented an example similar to this:
>>
>> Image a hypothetical system backed by Accumulo which stores medical
>> information. There are three labels in the system: PRIVATE, ANONYMIZED, and
>> PUBLIC. PRIVATE data is that which could reasonably be considered to
>> identify the individual. ANONYMIZED data is some altered version of the
>> attribute that retains some portion of the original value, but is missing
>> enough context to not identify the individual (e.g. converting the name
>> "Josh Elser" to "J E"). PUBLIC data is for attributes which are cannot
>> identify the individual.
>>
>> Doctors would be able to read the PRIVATE data, while researchers could
>> only read the ANONYMIZED and PUBLIC data. This leads to a question: how
>> much of each kind of data is in the system? Without knowing how much data
>> is in the system, how can some application developer (who does not have the
>> ability to read all of the PRIVATE data) know that their application is
>> returning an reasonably correct amount of data? (there are many examples of
>> questions which could be answer on this data alone)
>>
>> Concretely, this histogram would look like (50 records with PRIVATE, 50
>> with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>>
>> ```
>> PRIVATE: 50
>> ANONYMIZED: 50
>> PUBLIC: 20
>> ```
>>
>> Technically, I think this would actually be relatively simple to
>> implement. Inside of each RFile, we could maintain some histogram of the
>> visibilities observed in that file. This would allow us to very easily
>> report how much data in each table has each visibility label.
>>
>> However, would this feature be harmful to one of the core tenants of
>> Accumulo? Or, is acknowledging the existence of data in Accumulo with a
>> certain visibility acceptable? Would a new permission to use such an API to
>> access this information be sufficient to protect the data?
>>
>> - Josh
>>
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Keith Turner <ke...@deenlo.com>.
On Wed, Oct 12, 2016 at 12:56 AM, Christopher <ct...@apache.org> wrote:
> Keith, Russ, myself (and possible others) were discussing this at the
> hackathon after the Accumulo Summit, and I think our consensus were
> basically this:
>
> We need a generic pluggable mechanism for injecting arbitrary user counters
> into the RFiles. We can then use these counters in custom compaction
> strategies, or other analysis. We can aggregate these counters at the
> tablet, and table levels, and expose them in the API.
>
> These counters could store information about visibility frequencies, number
> of delete entries, etc.
>
> The interface might just be a Function<Entry<Key,Value>,Map<String, Long>>.

One thing I discussed with Russ was following Map Reduce's design for
counters inorder to avoid object allocation.  Something like the
following would avoid allocating a map to return.

interface Counters {
  void increment(ByteSequence counter, long amount);
}

interface Summarizer {
  void summarize(Key k, Value v, Counters counters)
}


>
> In the discussion, there were lots of variations on the theme, though. So,
> the actual implementation could vary. But, having something like this could
> support a large number of use cases beyond just the histogram case.
>
> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser <jo...@gmail.com> wrote:
>
>> Trivially. We could do something more intelligent like also cache it in
>> metadata (updating with compactions). Don't read too much into the
>> implementation at this point; it was just the first idea I had about how we
>> could do it :). I'm more concerned with the idea and its security
>> implications right now.
>>
>> In general, it seems like people are ok with it protected by a new
>> permission role. Do you have more to add, Mike? Was your comment based on
>> your interpretation of how Accumulo works or more a concern about
>> implementing such a feature?
>>
>> On Oct 11, 2016 21:29, <dl...@comcast.net> wrote:
>>
>> > So, to get the set of visibilities used in a table, we would have to open
>> > all of the rfiles?
>> >
>> > > -----Original Message-----
>> > > From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>> > > Sent: Tuesday, October 11, 2016 3:43 PM
>> > > To: Accumulo Dev List
>> > > Subject: Re: [DISCUSS] Would a visibility histogram on a table be
>> > harmful?
>> > >
>> > > Interesting idea.  It begs the question: should we allow any custom
>> > index at
>> > > the RFile level?  If RFile indexes were user-extensible, then a
>> > visibility index
>> > > would be something any developer could write.  That said, we can still
>> > > include such an index as an example, and if we did it could be used by
>> > the
>> > > Accumulo monitor.
>> > >
>> > > The RFile-level sampling followed this path.  I would support further
>> > work
>> > > similar to it, though I admit I don't know how difficult a job it
>> > entails.
>> > > Bonus points if the index information could be accessed from iterators
>> > the
>> > > same way that sampled data can.
>> > >
>> > > I can't speak to the appropriateness of visibility histograms on the
>> > monitor
>> > > *by default*, but it would be a strictly useful feature if it could be
>> > enabled via
>> > > a conf option.
>> > >
>> > >
>> > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <jo...@gmail.com>
>> > wrote:
>> > >
>> > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
>> he
>> > > > mentioned was the lack of insight into the distribution of data
>> marked
>> > > > with certain visibilities in a table. He presented an example similar
>> > to this:
>> > > >
>> > > > Image a hypothetical system backed by Accumulo which stores medical
>> > > > information. There are three labels in the system: PRIVATE,
>> > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
>> be
>> > > > considered to identify the individual. ANONYMIZED data is some
>> altered
>> > > > version of the attribute that retains some portion of the original
>> > > > value, but is missing enough context to not identify the individual
>> > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
>> > > > attributes which are cannot identify the individual.
>> > > >
>> > > > Doctors would be able to read the PRIVATE data, while researchers
>> > > > could only read the ANONYMIZED and PUBLIC data. This leads to a
>> > > > question: how much of each kind of data is in the system? Without
>> > > > knowing how much data is in the system, how can some application
>> > > > developer (who does not have the ability to read all of the PRIVATE
>> > > > data) know that their application is returning an reasonably correct
>> > > > amount of data? (there are many examples of questions which could be
>> > > > answer on this data alone)
>> > > >
>> > > > Concretely, this histogram would look like (50 records with PRIVATE,
>> > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>> > > >
>> > > > ```
>> > > > PRIVATE: 50
>> > > > ANONYMIZED: 50
>> > > > PUBLIC: 20
>> > > > ```
>> > > >
>> > > > Technically, I think this would actually be relatively simple to
>> > > > implement. Inside of each RFile, we could maintain some histogram of
>> > > > the visibilities observed in that file. This would allow us to very
>> > > > easily report how much data in each table has each visibility label.
>> > > >
>> > > > However, would this feature be harmful to one of the core tenants of
>> > > > Accumulo? Or, is acknowledging the existence of data in Accumulo with
>> > > > a certain visibility acceptable? Would a new permission to use such
>> an
>> > > > API to access this information be sufficient to protect the data?
>> > > >
>> > > > - Josh
>> > > >
>> >
>> >
>>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Josh Elser <jo...@gmail.com>.
A nice round number to track this work: 
https://issues.apache.org/jira/browse/ACCUMULO-4500

Josh Elser wrote:
> Thanks for the reply, Mike.
>
> Mike Drob wrote:
>> Hiding this behind the SystemPermission.SYSTEM permission might be
>> sufficient.
>
> Superb. Personally, I wouldn't want to piggy-back on SYSTEM.SYSTEM
> (because that permission implies a lot of other things too), but that's
> an implementation detail we can hash out later.
>
>> In a situation where Accumulo data is on an encrypted volume, or the
>> rfiles
>> themselves are encrypted, then a root user wouldn't be able to read the
>> rfiles to generate the histograms. This matches my initial mental
>> model of
>> an admin user that doesn't necessarily need to access to data and data
>> users that don't have access to admin commands. There is no all powerful
>> root user that can do everything and read everything.
>
> I agree with you that we should not assume an admin has the ability to
> read all data in all cases. In some cases it might, but the encrypted
> files is one good example that guarantees that cannot happen. I do draw
> a distinction between being able to read all data and generating a count
> of the unique visibility labels. I think that, in most cases, such a
> sketch on the visibilities in the system does not leak any sensitive
> data; however, hiding that access behind a system permission is a good
> compromise for those whose use-cases I haven't considered :)
>
>> Have we ever discussed an "emergency access, give me all the permissions"
>> model? I feel like I've heard John Vines mention this before, I think.
>> This
>> would be a reasonable extensions of that.
>
> I don't recall hearing of that one before, and I don't think I agree
> that this proposal is an extension of it. The number of records in the
> system and the visibility of them are purely "metadata" which do not
> expose identifying information about the actual data.
>
>> Mike
>>
>> On Fri, Oct 14, 2016 at 11:06 AM, Josh Elser<jo...@gmail.com> wrote:
>>
>>> Ping Marc/Mike D.
>>>
>>>
>>> Josh Elser wrote:
>>>
>>>> Thanks, Marc. Follow-on question(s) for you:
>>>>
>>>> Do you think _any_ such approach should never be pursued by Accumulo
>>>> (reading into your other replies about doing it outside of Accumulo)?
>>>> Are the permissions that we have in place not sufficient to protect
>>>> such
>>>> "metadata"?
>>>>
>>>> Or, would such a feature be "OK" to you if it required some degree of
>>>> additional manual steps by the administrator? (if so, what steps do you
>>>> think make this acceptable)
>>>>
>>>> In a similar vein, how do you see this broadening the scope of the
>>>> Accumulo security model in an invalid manner? e.g. Administrators
>>>> should
>>>> never be able to see such information. Someone with sufficient
>>>> access to
>>>> a system would already be able to bypass Accumulo's security
>>>> mechanisms.
>>>> There are a number of vectors already were a sufficiently-credentialed
>>>> individual could figure out this information (and more).
>>>>
>>>> Ultimately, I see Accumulo's main security tenet as "users should never
>>>> be allowed to see more data than they are authorized to see". Maybe
>>>> it's
>>>> my interpretation of that or the scope of how your think the proposed
>>>> feature would function, but I'd be very interested in hearing more
>>>> about
>>>> what you think.
>>>>
>>>> Marc P. wrote:
>>>>
>>>>> My point for discussing implementation outside of accumulo is
>>>>> because I
>>>>> think it does invalidate a core tenant
>>>>>
>>>>> On Wed, Oct 12, 2016, 12:26 PM Josh Elser<jo...@gmail.com> wrote:
>>>>>
>>>>> Again, can we please bring this discussion back from discussions of
>>>>>> implementations to security?
>>>>>>
>>>>>> Does the fact that you three were discussing implementations imply
>>>>>> that
>>>>>> you do not think this invalidates one of the core tenets (security
>>>>>> first) of Accumulo?
>>>>>>
>>>>>> Christopher wrote:
>>>>>>
>>>>>>> Keith, Russ, myself (and possible others) were discussing this at
>>>>>>> the
>>>>>>> hackathon after the Accumulo Summit, and I think our consensus were
>>>>>>> basically this:
>>>>>>>
>>>>>>> We need a generic pluggable mechanism for injecting arbitrary user
>>>>>>>
>>>>>> counters
>>>>>>
>>>>>>> into the RFiles. We can then use these counters in custom compaction
>>>>>>> strategies, or other analysis. We can aggregate these counters at
>>>>>>> the
>>>>>>> tablet, and table levels, and expose them in the API.
>>>>>>>
>>>>>>> These counters could store information about visibility frequencies,
>>>>>>>
>>>>>> number
>>>>>>
>>>>>>> of delete entries, etc.
>>>>>>>
>>>>>>> The interface might just be a Function<Entry<Key,Value>,Map<String,
>>>>>>>
>>>>>> Long>>.
>>>>>>
>>>>>>> In the discussion, there were lots of variations on the theme,
>>>>>>> though.
>>>>>>>
>>>>>> So,
>>>>>>
>>>>>>> the actual implementation could vary. But, having something like
>>>>>>> this
>>>>>>>
>>>>>> could
>>>>>>
>>>>>>> support a large number of use cases beyond just the histogram case.
>>>>>>>
>>>>>>> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<jo...@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Trivially. We could do something more intelligent like also cache
>>>>>>>> it in
>>>>>>>> metadata (updating with compactions). Don't read too much into the
>>>>>>>> implementation at this point; it was just the first idea I had
>>>>>>>> about
>>>>>>>>
>>>>>>> how we
>>>>>>> could do it :). I'm more concerned with the idea and its security
>>>>>>>> implications right now.
>>>>>>>>
>>>>>>>> In general, it seems like people are ok with it protected by a new
>>>>>>>> permission role. Do you have more to add, Mike? Was your comment
>>>>>>>> based
>>>>>>>>
>>>>>>> on
>>>>>>> your interpretation of how Accumulo works or more a concern about
>>>>>>>> implementing such a feature?
>>>>>>>>
>>>>>>>> On Oct 11, 2016 21:29,<dl...@comcast.net> wrote:
>>>>>>>>
>>>>>>>> So, to get the set of visibilities used in a table, we would
>>>>>>>> have to
>>>>>>>> open
>>>>>>> all of the rfiles?
>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>>>>>>>>>> Sent: Tuesday, October 11, 2016 3:43 PM
>>>>>>>>>> To: Accumulo Dev List
>>>>>>>>>> Subject: Re: [DISCUSS] Would a visibility histogram on a table be
>>>>>>>>>>
>>>>>>>>> harmful?
>>>>>>>>>
>>>>>>>>>> Interesting idea. It begs the question: should we allow any
>>>>>>>>>> custom
>>>>>>>>>>
>>>>>>>>> index at
>>>>>>>>>
>>>>>>>>>> the RFile level? If RFile indexes were user-extensible, then a
>>>>>>>>>>
>>>>>>>>> visibility index
>>>>>>>>>
>>>>>>>>>> would be something any developer could write. That said, we can
>>>>>>>>>> still
>>>>>>>>>> include such an index as an example, and if we did it could be
>>>>>>>>>> used by
>>>>>>>>>>
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>> Accumulo monitor.
>>>>>>>>>>
>>>>>>>>>> The RFile-level sampling followed this path. I would support
>>>>>>>>>> further
>>>>>>>>>>
>>>>>>>>> work
>>>>>>>>>
>>>>>>>>>> similar to it, though I admit I don't know how difficult a job it
>>>>>>>>>>
>>>>>>>>> entails.
>>>>>>>>>
>>>>>>>>>> Bonus points if the index information could be accessed from
>>>>>>>>>> iterators
>>>>>>>>>>
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>> same way that sampled data can.
>>>>>>>>>>
>>>>>>>>>> I can't speak to the appropriateness of visibility histograms
>>>>>>>>>> on the
>>>>>>>>>>
>>>>>>>>> monitor
>>>>>>>>>
>>>>>>>>>> *by default*, but it would be a strictly useful feature if it
>>>>>>>>>> could be
>>>>>>>>>>
>>>>>>>>> enabled via
>>>>>>>>>
>>>>>>>>>> a conf option.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh
>>>>>>>>>> Elser<jo...@gmail.com>
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Today at Accumulo Summit, our own Russ Weeks gave a talk. One
>>>>>>>>>> topic
>>>>>>>>>> he
>>>>>>>>> mentioned was the lack of insight into the distribution of data
>>>>>>>>>> marked
>>>>>>>>> with certain visibilities in a table. He presented an example
>>>>>>>>>>> similar
>>>>>>>>>>>
>>>>>>>>>> to this:
>>>>>>>>>> Image a hypothetical system backed by Accumulo which stores
>>>>>>>>>> medical
>>>>>>>>>>> information. There are three labels in the system: PRIVATE,
>>>>>>>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which could
>>>>>>>>>>> reasonably
>>>>>>>>>>>
>>>>>>>>>> be
>>>>>>>>> considered to identify the individual. ANONYMIZED data is some
>>>>>>>>>> altered
>>>>>>>>> version of the attribute that retains some portion of the original
>>>>>>>>>>> value, but is missing enough context to not identify the
>>>>>>>>>>> individual
>>>>>>>>>>> (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is
>>>>>>>>>>> for
>>>>>>>>>>> attributes which are cannot identify the individual.
>>>>>>>>>>>
>>>>>>>>>>> Doctors would be able to read the PRIVATE data, while
>>>>>>>>>>> researchers
>>>>>>>>>>> could only read the ANONYMIZED and PUBLIC data. This leads to a
>>>>>>>>>>> question: how much of each kind of data is in the system?
>>>>>>>>>>> Without
>>>>>>>>>>> knowing how much data is in the system, how can some application
>>>>>>>>>>> developer (who does not have the ability to read all of the
>>>>>>>>>>> PRIVATE
>>>>>>>>>>> data) know that their application is returning an reasonably
>>>>>>>>>>> correct
>>>>>>>>>>> amount of data? (there are many examples of questions which
>>>>>>>>>>> could be
>>>>>>>>>>> answer on this data alone)
>>>>>>>>>>>
>>>>>>>>>>> Concretely, this histogram would look like (50 records with
>>>>>>>>>>> PRIVATE,
>>>>>>>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> PRIVATE: 50
>>>>>>>>>>> ANONYMIZED: 50
>>>>>>>>>>> PUBLIC: 20
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> Technically, I think this would actually be relatively simple to
>>>>>>>>>>> implement. Inside of each RFile, we could maintain some
>>>>>>>>>>> histogram of
>>>>>>>>>>> the visibilities observed in that file. This would allow us
>>>>>>>>>>> to very
>>>>>>>>>>> easily report how much data in each table has each visibility
>>>>>>>>>>> label.
>>>>>>>>>>>
>>>>>>>>>>> However, would this feature be harmful to one of the core
>>>>>>>>>>> tenants of
>>>>>>>>>>> Accumulo? Or, is acknowledging the existence of data in Accumulo
>>>>>>>>>>> with
>>>>>>>>>>> a certain visibility acceptable? Would a new permission to
>>>>>>>>>>> use such
>>>>>>>>>>>
>>>>>>>>>> an
>>>>>>>>> API to access this information be sufficient to protect the data?
>>>>>>>>>>> - Josh
>>>>>>>>>>>
>>>>>>>>>>>
>>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Josh Elser <jo...@gmail.com>.
Thanks for the reply, Mike.

Mike Drob wrote:
> Hiding this behind the SystemPermission.SYSTEM permission might be
> sufficient.

Superb. Personally, I wouldn't want to piggy-back on SYSTEM.SYSTEM 
(because that permission implies a lot of other things too), but that's 
an implementation detail we can hash out later.

> In a situation where Accumulo data is on an encrypted volume, or the rfiles
> themselves are encrypted, then a root user wouldn't be able to read the
> rfiles to generate the histograms. This matches my initial mental model of
> an admin user that doesn't necessarily need to access to data and data
> users that don't have access to admin commands. There is no all powerful
> root user that can do everything and read everything.

I agree with you that we should not assume an admin has the ability to 
read all data in all cases. In some cases it might, but the encrypted 
files is one good example that guarantees that cannot happen. I do draw 
a distinction between being able to read all data and generating a count 
of the unique visibility labels. I think that, in most cases, such a 
sketch on the visibilities in the system does not leak any sensitive 
data; however, hiding that access behind a system permission is a good 
compromise for those whose use-cases I haven't considered :)

> Have we ever discussed an "emergency access, give me all the permissions"
> model? I feel like I've heard John Vines mention this before, I think. This
> would be a reasonable extensions of that.

I don't recall hearing of that one before, and I don't think I agree 
that this proposal is an extension of it. The number of records in the 
system and the visibility of them are purely "metadata" which do not 
expose identifying information about the actual data.

> Mike
>
> On Fri, Oct 14, 2016 at 11:06 AM, Josh Elser<jo...@gmail.com>  wrote:
>
>> Ping Marc/Mike D.
>>
>>
>> Josh Elser wrote:
>>
>>> Thanks, Marc. Follow-on question(s) for you:
>>>
>>> Do you think _any_ such approach should never be pursued by Accumulo
>>> (reading into your other replies about doing it outside of Accumulo)?
>>> Are the permissions that we have in place not sufficient to protect such
>>> "metadata"?
>>>
>>> Or, would such a feature be "OK" to you if it required some degree of
>>> additional manual steps by the administrator? (if so, what steps do you
>>> think make this acceptable)
>>>
>>> In a similar vein, how do you see this broadening the scope of the
>>> Accumulo security model in an invalid manner? e.g. Administrators should
>>> never be able to see such information. Someone with sufficient access to
>>> a system would already be able to bypass Accumulo's security mechanisms.
>>> There are a number of vectors already were a sufficiently-credentialed
>>> individual could figure out this information (and more).
>>>
>>> Ultimately, I see Accumulo's main security tenet as "users should never
>>> be allowed to see more data than they are authorized to see". Maybe it's
>>> my interpretation of that or the scope of how your think the proposed
>>> feature would function, but I'd be very interested in hearing more about
>>> what you think.
>>>
>>> Marc P. wrote:
>>>
>>>> My point for discussing implementation outside of accumulo is because I
>>>> think it does invalidate a core tenant
>>>>
>>>> On Wed, Oct 12, 2016, 12:26 PM Josh Elser<jo...@gmail.com>  wrote:
>>>>
>>>> Again, can we please bring this discussion back from discussions of
>>>>> implementations to security?
>>>>>
>>>>> Does the fact that you three were discussing implementations imply that
>>>>> you do not think this invalidates one of the core tenets (security
>>>>> first) of Accumulo?
>>>>>
>>>>> Christopher wrote:
>>>>>
>>>>>> Keith, Russ, myself (and possible others) were discussing this at the
>>>>>> hackathon after the Accumulo Summit, and I think our consensus were
>>>>>> basically this:
>>>>>>
>>>>>> We need a generic pluggable mechanism for injecting arbitrary user
>>>>>>
>>>>> counters
>>>>>
>>>>>> into the RFiles. We can then use these counters in custom compaction
>>>>>> strategies, or other analysis. We can aggregate these counters at the
>>>>>> tablet, and table levels, and expose them in the API.
>>>>>>
>>>>>> These counters could store information about visibility frequencies,
>>>>>>
>>>>> number
>>>>>
>>>>>> of delete entries, etc.
>>>>>>
>>>>>> The interface might just be a Function<Entry<Key,Value>,Map<String,
>>>>>>
>>>>> Long>>.
>>>>>
>>>>>> In the discussion, there were lots of variations on the theme, though.
>>>>>>
>>>>> So,
>>>>>
>>>>>> the actual implementation could vary. But, having something like this
>>>>>>
>>>>> could
>>>>>
>>>>>> support a large number of use cases beyond just the histogram case.
>>>>>>
>>>>>> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<jo...@gmail.com>
>>>>>>
>>>>> wrote:
>>>>>
>>>>>> Trivially. We could do something more intelligent like also cache
>>>>>>> it in
>>>>>>> metadata (updating with compactions). Don't read too much into the
>>>>>>> implementation at this point; it was just the first idea I had about
>>>>>>>
>>>>>> how we
>>>>>> could do it :). I'm more concerned with the idea and its security
>>>>>>> implications right now.
>>>>>>>
>>>>>>> In general, it seems like people are ok with it protected by a new
>>>>>>> permission role. Do you have more to add, Mike? Was your comment based
>>>>>>>
>>>>>> on
>>>>>> your interpretation of how Accumulo works or more a concern about
>>>>>>> implementing such a feature?
>>>>>>>
>>>>>>> On Oct 11, 2016 21:29,<dl...@comcast.net>  wrote:
>>>>>>>
>>>>>>> So, to get the set of visibilities used in a table, we would have to
>>>>>>> open
>>>>>> all of the rfiles?
>>>>>>>> -----Original Message-----
>>>>>>>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>>>>>>>>> Sent: Tuesday, October 11, 2016 3:43 PM
>>>>>>>>> To: Accumulo Dev List
>>>>>>>>> Subject: Re: [DISCUSS] Would a visibility histogram on a table be
>>>>>>>>>
>>>>>>>> harmful?
>>>>>>>>
>>>>>>>>> Interesting idea. It begs the question: should we allow any custom
>>>>>>>>>
>>>>>>>> index at
>>>>>>>>
>>>>>>>>> the RFile level? If RFile indexes were user-extensible, then a
>>>>>>>>>
>>>>>>>> visibility index
>>>>>>>>
>>>>>>>>> would be something any developer could write. That said, we can
>>>>>>>>> still
>>>>>>>>> include such an index as an example, and if we did it could be
>>>>>>>>> used by
>>>>>>>>>
>>>>>>>> the
>>>>>>>>
>>>>>>>>> Accumulo monitor.
>>>>>>>>>
>>>>>>>>> The RFile-level sampling followed this path. I would support further
>>>>>>>>>
>>>>>>>> work
>>>>>>>>
>>>>>>>>> similar to it, though I admit I don't know how difficult a job it
>>>>>>>>>
>>>>>>>> entails.
>>>>>>>>
>>>>>>>>> Bonus points if the index information could be accessed from
>>>>>>>>> iterators
>>>>>>>>>
>>>>>>>> the
>>>>>>>>
>>>>>>>>> same way that sampled data can.
>>>>>>>>>
>>>>>>>>> I can't speak to the appropriateness of visibility histograms on the
>>>>>>>>>
>>>>>>>> monitor
>>>>>>>>
>>>>>>>>> *by default*, but it would be a strictly useful feature if it
>>>>>>>>> could be
>>>>>>>>>
>>>>>>>> enabled via
>>>>>>>>
>>>>>>>>> a conf option.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<jo...@gmail.com>
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
>>>>>>>>> he
>>>>>>>> mentioned was the lack of insight into the distribution of data
>>>>>>>>> marked
>>>>>>>> with certain visibilities in a table. He presented an example
>>>>>>>>>> similar
>>>>>>>>>>
>>>>>>>>> to this:
>>>>>>>>> Image a hypothetical system backed by Accumulo which stores medical
>>>>>>>>>> information. There are three labels in the system: PRIVATE,
>>>>>>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
>>>>>>>>>>
>>>>>>>>> be
>>>>>>>> considered to identify the individual. ANONYMIZED data is some
>>>>>>>>> altered
>>>>>>>> version of the attribute that retains some portion of the original
>>>>>>>>>> value, but is missing enough context to not identify the individual
>>>>>>>>>> (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is
>>>>>>>>>> for
>>>>>>>>>> attributes which are cannot identify the individual.
>>>>>>>>>>
>>>>>>>>>> Doctors would be able to read the PRIVATE data, while researchers
>>>>>>>>>> could only read the ANONYMIZED and PUBLIC data. This leads to a
>>>>>>>>>> question: how much of each kind of data is in the system? Without
>>>>>>>>>> knowing how much data is in the system, how can some application
>>>>>>>>>> developer (who does not have the ability to read all of the PRIVATE
>>>>>>>>>> data) know that their application is returning an reasonably
>>>>>>>>>> correct
>>>>>>>>>> amount of data? (there are many examples of questions which
>>>>>>>>>> could be
>>>>>>>>>> answer on this data alone)
>>>>>>>>>>
>>>>>>>>>> Concretely, this histogram would look like (50 records with
>>>>>>>>>> PRIVATE,
>>>>>>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>>>>>>>>>>
>>>>>>>>>> ```
>>>>>>>>>> PRIVATE: 50
>>>>>>>>>> ANONYMIZED: 50
>>>>>>>>>> PUBLIC: 20
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>> Technically, I think this would actually be relatively simple to
>>>>>>>>>> implement. Inside of each RFile, we could maintain some
>>>>>>>>>> histogram of
>>>>>>>>>> the visibilities observed in that file. This would allow us to very
>>>>>>>>>> easily report how much data in each table has each visibility
>>>>>>>>>> label.
>>>>>>>>>>
>>>>>>>>>> However, would this feature be harmful to one of the core
>>>>>>>>>> tenants of
>>>>>>>>>> Accumulo? Or, is acknowledging the existence of data in Accumulo
>>>>>>>>>> with
>>>>>>>>>> a certain visibility acceptable? Would a new permission to use such
>>>>>>>>>>
>>>>>>>>> an
>>>>>>>> API to access this information be sufficient to protect the data?
>>>>>>>>>> - Josh
>>>>>>>>>>
>>>>>>>>>>
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Mike Drob <md...@mdrob.com>.
Hiding this behind the SystemPermission.SYSTEM permission might be
sufficient.

In a situation where Accumulo data is on an encrypted volume, or the rfiles
themselves are encrypted, then a root user wouldn't be able to read the
rfiles to generate the histograms. This matches my initial mental model of
an admin user that doesn't necessarily need to access to data and data
users that don't have access to admin commands. There is no all powerful
root user that can do everything and read everything.

Have we ever discussed an "emergency access, give me all the permissions"
model? I feel like I've heard John Vines mention this before, I think. This
would be a reasonable extensions of that.

Mike

On Fri, Oct 14, 2016 at 11:06 AM, Josh Elser <jo...@gmail.com> wrote:

> Ping Marc/Mike D.
>
>
> Josh Elser wrote:
>
>> Thanks, Marc. Follow-on question(s) for you:
>>
>> Do you think _any_ such approach should never be pursued by Accumulo
>> (reading into your other replies about doing it outside of Accumulo)?
>> Are the permissions that we have in place not sufficient to protect such
>> "metadata"?
>>
>> Or, would such a feature be "OK" to you if it required some degree of
>> additional manual steps by the administrator? (if so, what steps do you
>> think make this acceptable)
>>
>> In a similar vein, how do you see this broadening the scope of the
>> Accumulo security model in an invalid manner? e.g. Administrators should
>> never be able to see such information. Someone with sufficient access to
>> a system would already be able to bypass Accumulo's security mechanisms.
>> There are a number of vectors already were a sufficiently-credentialed
>> individual could figure out this information (and more).
>>
>> Ultimately, I see Accumulo's main security tenet as "users should never
>> be allowed to see more data than they are authorized to see". Maybe it's
>> my interpretation of that or the scope of how your think the proposed
>> feature would function, but I'd be very interested in hearing more about
>> what you think.
>>
>> Marc P. wrote:
>>
>>> My point for discussing implementation outside of accumulo is because I
>>> think it does invalidate a core tenant
>>>
>>> On Wed, Oct 12, 2016, 12:26 PM Josh Elser<jo...@gmail.com> wrote:
>>>
>>> Again, can we please bring this discussion back from discussions of
>>>> implementations to security?
>>>>
>>>> Does the fact that you three were discussing implementations imply that
>>>> you do not think this invalidates one of the core tenets (security
>>>> first) of Accumulo?
>>>>
>>>> Christopher wrote:
>>>>
>>>>> Keith, Russ, myself (and possible others) were discussing this at the
>>>>> hackathon after the Accumulo Summit, and I think our consensus were
>>>>> basically this:
>>>>>
>>>>> We need a generic pluggable mechanism for injecting arbitrary user
>>>>>
>>>> counters
>>>>
>>>>> into the RFiles. We can then use these counters in custom compaction
>>>>> strategies, or other analysis. We can aggregate these counters at the
>>>>> tablet, and table levels, and expose them in the API.
>>>>>
>>>>> These counters could store information about visibility frequencies,
>>>>>
>>>> number
>>>>
>>>>> of delete entries, etc.
>>>>>
>>>>> The interface might just be a Function<Entry<Key,Value>,Map<String,
>>>>>
>>>> Long>>.
>>>>
>>>>> In the discussion, there were lots of variations on the theme, though.
>>>>>
>>>> So,
>>>>
>>>>> the actual implementation could vary. But, having something like this
>>>>>
>>>> could
>>>>
>>>>> support a large number of use cases beyond just the histogram case.
>>>>>
>>>>> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<jo...@gmail.com>
>>>>>
>>>> wrote:
>>>>
>>>>> Trivially. We could do something more intelligent like also cache
>>>>>> it in
>>>>>> metadata (updating with compactions). Don't read too much into the
>>>>>> implementation at this point; it was just the first idea I had about
>>>>>>
>>>>> how we
>>>>
>>>>> could do it :). I'm more concerned with the idea and its security
>>>>>> implications right now.
>>>>>>
>>>>>> In general, it seems like people are ok with it protected by a new
>>>>>> permission role. Do you have more to add, Mike? Was your comment based
>>>>>>
>>>>> on
>>>>
>>>>> your interpretation of how Accumulo works or more a concern about
>>>>>> implementing such a feature?
>>>>>>
>>>>>> On Oct 11, 2016 21:29,<dl...@comcast.net> wrote:
>>>>>>
>>>>>> So, to get the set of visibilities used in a table, we would have to
>>>>>>>
>>>>>> open
>>>>
>>>>> all of the rfiles?
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>>>>>>>> Sent: Tuesday, October 11, 2016 3:43 PM
>>>>>>>> To: Accumulo Dev List
>>>>>>>> Subject: Re: [DISCUSS] Would a visibility histogram on a table be
>>>>>>>>
>>>>>>> harmful?
>>>>>>>
>>>>>>>> Interesting idea. It begs the question: should we allow any custom
>>>>>>>>
>>>>>>> index at
>>>>>>>
>>>>>>>> the RFile level? If RFile indexes were user-extensible, then a
>>>>>>>>
>>>>>>> visibility index
>>>>>>>
>>>>>>>> would be something any developer could write. That said, we can
>>>>>>>> still
>>>>>>>> include such an index as an example, and if we did it could be
>>>>>>>> used by
>>>>>>>>
>>>>>>> the
>>>>>>>
>>>>>>>> Accumulo monitor.
>>>>>>>>
>>>>>>>> The RFile-level sampling followed this path. I would support further
>>>>>>>>
>>>>>>> work
>>>>>>>
>>>>>>>> similar to it, though I admit I don't know how difficult a job it
>>>>>>>>
>>>>>>> entails.
>>>>>>>
>>>>>>>> Bonus points if the index information could be accessed from
>>>>>>>> iterators
>>>>>>>>
>>>>>>> the
>>>>>>>
>>>>>>>> same way that sampled data can.
>>>>>>>>
>>>>>>>> I can't speak to the appropriateness of visibility histograms on the
>>>>>>>>
>>>>>>> monitor
>>>>>>>
>>>>>>>> *by default*, but it would be a strictly useful feature if it
>>>>>>>> could be
>>>>>>>>
>>>>>>> enabled via
>>>>>>>
>>>>>>>> a conf option.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<jo...@gmail.com>
>>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
>>>>>>>>>
>>>>>>>> he
>>>>>>
>>>>>>> mentioned was the lack of insight into the distribution of data
>>>>>>>>>
>>>>>>>> marked
>>>>>>
>>>>>>> with certain visibilities in a table. He presented an example
>>>>>>>>> similar
>>>>>>>>>
>>>>>>>> to this:
>>>>>>>
>>>>>>>> Image a hypothetical system backed by Accumulo which stores medical
>>>>>>>>> information. There are three labels in the system: PRIVATE,
>>>>>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
>>>>>>>>>
>>>>>>>> be
>>>>>>
>>>>>>> considered to identify the individual. ANONYMIZED data is some
>>>>>>>>>
>>>>>>>> altered
>>>>>>
>>>>>>> version of the attribute that retains some portion of the original
>>>>>>>>> value, but is missing enough context to not identify the individual
>>>>>>>>> (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is
>>>>>>>>> for
>>>>>>>>> attributes which are cannot identify the individual.
>>>>>>>>>
>>>>>>>>> Doctors would be able to read the PRIVATE data, while researchers
>>>>>>>>> could only read the ANONYMIZED and PUBLIC data. This leads to a
>>>>>>>>> question: how much of each kind of data is in the system? Without
>>>>>>>>> knowing how much data is in the system, how can some application
>>>>>>>>> developer (who does not have the ability to read all of the PRIVATE
>>>>>>>>> data) know that their application is returning an reasonably
>>>>>>>>> correct
>>>>>>>>> amount of data? (there are many examples of questions which
>>>>>>>>> could be
>>>>>>>>> answer on this data alone)
>>>>>>>>>
>>>>>>>>> Concretely, this histogram would look like (50 records with
>>>>>>>>> PRIVATE,
>>>>>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>>>>>>>>>
>>>>>>>>> ```
>>>>>>>>> PRIVATE: 50
>>>>>>>>> ANONYMIZED: 50
>>>>>>>>> PUBLIC: 20
>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> Technically, I think this would actually be relatively simple to
>>>>>>>>> implement. Inside of each RFile, we could maintain some
>>>>>>>>> histogram of
>>>>>>>>> the visibilities observed in that file. This would allow us to very
>>>>>>>>> easily report how much data in each table has each visibility
>>>>>>>>> label.
>>>>>>>>>
>>>>>>>>> However, would this feature be harmful to one of the core
>>>>>>>>> tenants of
>>>>>>>>> Accumulo? Or, is acknowledging the existence of data in Accumulo
>>>>>>>>> with
>>>>>>>>> a certain visibility acceptable? Would a new permission to use such
>>>>>>>>>
>>>>>>>> an
>>>>>>
>>>>>>> API to access this information be sufficient to protect the data?
>>>>>>>>>
>>>>>>>>> - Josh
>>>>>>>>>
>>>>>>>>>
>>>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Josh Elser <jo...@gmail.com>.
Ping Marc/Mike D.

Josh Elser wrote:
> Thanks, Marc. Follow-on question(s) for you:
>
> Do you think _any_ such approach should never be pursued by Accumulo
> (reading into your other replies about doing it outside of Accumulo)?
> Are the permissions that we have in place not sufficient to protect such
> "metadata"?
>
> Or, would such a feature be "OK" to you if it required some degree of
> additional manual steps by the administrator? (if so, what steps do you
> think make this acceptable)
>
> In a similar vein, how do you see this broadening the scope of the
> Accumulo security model in an invalid manner? e.g. Administrators should
> never be able to see such information. Someone with sufficient access to
> a system would already be able to bypass Accumulo's security mechanisms.
> There are a number of vectors already were a sufficiently-credentialed
> individual could figure out this information (and more).
>
> Ultimately, I see Accumulo's main security tenet as "users should never
> be allowed to see more data than they are authorized to see". Maybe it's
> my interpretation of that or the scope of how your think the proposed
> feature would function, but I'd be very interested in hearing more about
> what you think.
>
> Marc P. wrote:
>> My point for discussing implementation outside of accumulo is because I
>> think it does invalidate a core tenant
>>
>> On Wed, Oct 12, 2016, 12:26 PM Josh Elser<jo...@gmail.com> wrote:
>>
>>> Again, can we please bring this discussion back from discussions of
>>> implementations to security?
>>>
>>> Does the fact that you three were discussing implementations imply that
>>> you do not think this invalidates one of the core tenets (security
>>> first) of Accumulo?
>>>
>>> Christopher wrote:
>>>> Keith, Russ, myself (and possible others) were discussing this at the
>>>> hackathon after the Accumulo Summit, and I think our consensus were
>>>> basically this:
>>>>
>>>> We need a generic pluggable mechanism for injecting arbitrary user
>>> counters
>>>> into the RFiles. We can then use these counters in custom compaction
>>>> strategies, or other analysis. We can aggregate these counters at the
>>>> tablet, and table levels, and expose them in the API.
>>>>
>>>> These counters could store information about visibility frequencies,
>>> number
>>>> of delete entries, etc.
>>>>
>>>> The interface might just be a Function<Entry<Key,Value>,Map<String,
>>> Long>>.
>>>> In the discussion, there were lots of variations on the theme, though.
>>> So,
>>>> the actual implementation could vary. But, having something like this
>>> could
>>>> support a large number of use cases beyond just the histogram case.
>>>>
>>>> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<jo...@gmail.com>
>>> wrote:
>>>>> Trivially. We could do something more intelligent like also cache
>>>>> it in
>>>>> metadata (updating with compactions). Don't read too much into the
>>>>> implementation at this point; it was just the first idea I had about
>>> how we
>>>>> could do it :). I'm more concerned with the idea and its security
>>>>> implications right now.
>>>>>
>>>>> In general, it seems like people are ok with it protected by a new
>>>>> permission role. Do you have more to add, Mike? Was your comment based
>>> on
>>>>> your interpretation of how Accumulo works or more a concern about
>>>>> implementing such a feature?
>>>>>
>>>>> On Oct 11, 2016 21:29,<dl...@comcast.net> wrote:
>>>>>
>>>>>> So, to get the set of visibilities used in a table, we would have to
>>> open
>>>>>> all of the rfiles?
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>>>>>>> Sent: Tuesday, October 11, 2016 3:43 PM
>>>>>>> To: Accumulo Dev List
>>>>>>> Subject: Re: [DISCUSS] Would a visibility histogram on a table be
>>>>>> harmful?
>>>>>>> Interesting idea. It begs the question: should we allow any custom
>>>>>> index at
>>>>>>> the RFile level? If RFile indexes were user-extensible, then a
>>>>>> visibility index
>>>>>>> would be something any developer could write. That said, we can
>>>>>>> still
>>>>>>> include such an index as an example, and if we did it could be
>>>>>>> used by
>>>>>> the
>>>>>>> Accumulo monitor.
>>>>>>>
>>>>>>> The RFile-level sampling followed this path. I would support further
>>>>>> work
>>>>>>> similar to it, though I admit I don't know how difficult a job it
>>>>>> entails.
>>>>>>> Bonus points if the index information could be accessed from
>>>>>>> iterators
>>>>>> the
>>>>>>> same way that sampled data can.
>>>>>>>
>>>>>>> I can't speak to the appropriateness of visibility histograms on the
>>>>>> monitor
>>>>>>> *by default*, but it would be a strictly useful feature if it
>>>>>>> could be
>>>>>> enabled via
>>>>>>> a conf option.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<jo...@gmail.com>
>>>>>> wrote:
>>>>>>>> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
>>>>> he
>>>>>>>> mentioned was the lack of insight into the distribution of data
>>>>> marked
>>>>>>>> with certain visibilities in a table. He presented an example
>>>>>>>> similar
>>>>>> to this:
>>>>>>>> Image a hypothetical system backed by Accumulo which stores medical
>>>>>>>> information. There are three labels in the system: PRIVATE,
>>>>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
>>>>> be
>>>>>>>> considered to identify the individual. ANONYMIZED data is some
>>>>> altered
>>>>>>>> version of the attribute that retains some portion of the original
>>>>>>>> value, but is missing enough context to not identify the individual
>>>>>>>> (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is
>>>>>>>> for
>>>>>>>> attributes which are cannot identify the individual.
>>>>>>>>
>>>>>>>> Doctors would be able to read the PRIVATE data, while researchers
>>>>>>>> could only read the ANONYMIZED and PUBLIC data. This leads to a
>>>>>>>> question: how much of each kind of data is in the system? Without
>>>>>>>> knowing how much data is in the system, how can some application
>>>>>>>> developer (who does not have the ability to read all of the PRIVATE
>>>>>>>> data) know that their application is returning an reasonably
>>>>>>>> correct
>>>>>>>> amount of data? (there are many examples of questions which
>>>>>>>> could be
>>>>>>>> answer on this data alone)
>>>>>>>>
>>>>>>>> Concretely, this histogram would look like (50 records with
>>>>>>>> PRIVATE,
>>>>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>>>>>>>>
>>>>>>>> ```
>>>>>>>> PRIVATE: 50
>>>>>>>> ANONYMIZED: 50
>>>>>>>> PUBLIC: 20
>>>>>>>> ```
>>>>>>>>
>>>>>>>> Technically, I think this would actually be relatively simple to
>>>>>>>> implement. Inside of each RFile, we could maintain some
>>>>>>>> histogram of
>>>>>>>> the visibilities observed in that file. This would allow us to very
>>>>>>>> easily report how much data in each table has each visibility
>>>>>>>> label.
>>>>>>>>
>>>>>>>> However, would this feature be harmful to one of the core
>>>>>>>> tenants of
>>>>>>>> Accumulo? Or, is acknowledging the existence of data in Accumulo
>>>>>>>> with
>>>>>>>> a certain visibility acceptable? Would a new permission to use such
>>>>> an
>>>>>>>> API to access this information be sufficient to protect the data?
>>>>>>>>
>>>>>>>> - Josh
>>>>>>>>
>>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Christopher <ct...@apache.org>.
I think SystemPermission.SYSTEM permission should probably be required for
any public API retrieving this data. It is, after all, code run on servers,
generating data directly from the RFiles. This would also imply that
caution is needed if we were to cache the data in, say, the metadata table.

On Wed, Oct 12, 2016 at 3:58 PM Josh Elser <jo...@gmail.com> wrote:

> I was envisioning public API protected by a system permission (implying
> some Thrift RPC as well) if that is an important distinction for those
> with concerns. I am hoping to get more info from Mike/Marc about why
> they feel this is insufficient WRT Accumulo's security model.
>
> Keith Turner wrote:
> > We did discuss making this info available through the public API (and
> > adding thrift calls to gather it).   We discussed the possibility of
> > adding a new permission.
> >
> > On Wed, Oct 12, 2016 at 2:35 PM, ivan bella<iv...@ivan.bella.name>
> wrote:
> >> I do not see how this invalidates any security of the system unless you
> are summarizing these counters and making them available through a thrift
> or other call; don't do that unless other security is put in place.  To get
> a summary I would think you would have to use a separate utility to scrape
> the rfiles.  This metadata should only be accessible to a system
> administrator.  The BIG presumption here is that is is significantly faster
> to grab this metadata data out than it is to scan all of the keys in the
> rfile.
> >>
> >>
> >>> On October 12, 2016 at 1:41 PM Josh Elser<jo...@gmail.com>
> wrote:
> >>>
> >>> Thanks, Marc. Follow-on question(s) for you:
> >>>
> >>> Do you think _any_ such approach should never be pursued by Accumulo
> >>> (reading into your other replies about doing it outside of Accumulo)?
> >>> Are the permissions that we have in place not sufficient to protect
> such
> >>> "metadata"?
> >>>
> >>> Or, would such a feature be "OK" to you if it required some degree of
> >>> additional manual steps by the administrator? (if so, what steps do you
> >>> think make this acceptable)
> >>>
> >>> In a similar vein, how do you see this broadening the scope of the
> >>> Accumulo security model in an invalid manner? e.g. Administrators
> should
> >>> never be able to see such information. Someone with sufficient access
> to
> >>> a system would already be able to bypass Accumulo's security
> mechanisms.
> >>> There are a number of vectors already were a sufficiently-credentialed
> >>> individual could figure out this information (and more).
> >>>
> >>> Ultimately, I see Accumulo's main security tenet as "users should never
> >>> be allowed to see more data than they are authorized to see". Maybe
> it's
> >>> my interpretation of that or the scope of how your think the proposed
> >>> feature would function, but I'd be very interested in hearing more
> about
> >>> what you think.
> >>>
> >>> Marc P. wrote:
> >>>
> >>>> My point for discussing implementation outside of accumulo is because
> I
> >>>> think it does invalidate a core tenant
> >>>>
> >>>> On Wed, Oct 12, 2016, 12:26 PM Josh Elser<jo...@gmail.com>
> wrote:
> >>>>
> >>>>> Again, can we please bring this discussion back from discussions of
> >>>>> implementations to security?
> >>>>>
> >>>>> Does the fact that you three were discussing implementations imply
> that
> >>>>> you do not think this invalidates one of the core tenets (security
> >>>>> first) of Accumulo?
> >>>>>
> >>>>> Christopher wrote:
> >>>>>
> >>>>>> Keith, Russ, myself (and possible others) were discussing this at
> the
> >>>>>> hackathon after the Accumulo Summit, and I think our consensus were
> >>>>>> basically this:
> >>>>>>
> >>>>>> We need a generic pluggable mechanism for injecting arbitrary user
> >>>>>> counters
> >>>>>> into the RFiles. We can then use these counters in custom compaction
> >>>>>> strategies, or other analysis. We can aggregate these counters at
> the
> >>>>>> tablet, and table levels, and expose them in the API.
> >>>>>>
> >>>>>> These counters could store information about visibility frequencies,
> >>>>>> number
> >>>>>> of delete entries, etc.
> >>>>>>
> >>>>>> The interface might just be a Function<Entry<Key,Value>,Map<String,
> Long>>.
> >>>>>> In the discussion, there were lots of variations on the theme,
> though.
> >>>>>> So,
> >>>>>> the actual implementation could vary. But, having something like
> this
> >>>>>> could
> >>>>>> support a large number of use cases beyond just the histogram case.
> >>>>>>
> >>>>>> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<jo...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Trivially. We could do something more intelligent like also cache
> it in
> >>>>>>> metadata (updating with compactions). Don't read too much into the
> >>>>>>> implementation at this point; it was just the first idea I had
> about
> >>>>>>> how we
> >>>>>>> could do it :). I'm more concerned with the idea and its security
> >>>>>>> implications right now.
> >>>>>>>
> >>>>>>> In general, it seems like people are ok with it protected by a new
> >>>>>>> permission role. Do you have more to add, Mike? Was your comment
> based
> >>>>>>> on
> >>>>>>> your interpretation of how Accumulo works or more a concern about
> >>>>>>> implementing such a feature?
> >>>>>>>
> >>>>>>> On Oct 11, 2016 21:29,<dl...@comcast.net>  wrote:
> >>>>>>>
> >>>>>>>> So, to get the set of visibilities used in a table, we would have
> to
> >>>>>>>> open
> >>>>>>>> all of the rfiles?
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
> >>>>>>>>> Sent: Tuesday, October 11, 2016 3:43 PM
> >>>>>>>>> To: Accumulo Dev List
> >>>>>>>>> Subject: Re: [DISCUSS] Would a visibility histogram on a table be
> >>>>>>>>> harmful?
> >>>>>>>>> Interesting idea. It begs the question: should we allow any
> custom
> >>>>>>>>> index at
> >>>>>>>>> the RFile level? If RFile indexes were user-extensible, then a
> >>>>>>>>> visibility index
> >>>>>>>>> would be something any developer could write. That said, we can
> still
> >>>>>>>>> include such an index as an example, and if we did it could be
> used by
> >>>>>>>>> the
> >>>>>>>>> Accumulo monitor.
> >>>>>>>>>
> >>>>>>>>> The RFile-level sampling followed this path. I would support
> further
> >>>>>>>>> work
> >>>>>>>>> similar to it, though I admit I don't know how difficult a job it
> >>>>>>>>> entails.
> >>>>>>>>> Bonus points if the index information could be accessed from
> iterators
> >>>>>>>>> the
> >>>>>>>>> same way that sampled data can.
> >>>>>>>>>
> >>>>>>>>> I can't speak to the appropriateness of visibility histograms on
> the
> >>>>>>>>> monitor
> >>>>>>>>> *by default*, but it would be a strictly useful feature if it
> could be
> >>>>>>>>> enabled via
> >>>>>>>>> a conf option.
> >>>>>>>>>
> >>>>>>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<
> josh.elser@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Today at Accumulo Summit, our own Russ Weeks gave a talk. One
> topic
> >>>>>>>>>> he
> >>>>>>>>>> mentioned was the lack of insight into the distribution of data
> >>>>>>>>>> marked
> >>>>>>>>>> with certain visibilities in a table. He presented an example
> similar
> >>>>>>>>>> to this:
> >>>>>>>>>> Image a hypothetical system backed by Accumulo which stores
> medical
> >>>>>>>>>> information. There are three labels in the system: PRIVATE,
> >>>>>>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which could
> reasonably
> >>>>>>>>>> be
> >>>>>>>>>> considered to identify the individual. ANONYMIZED data is some
> >>>>>>>>>> altered
> >>>>>>>>>> version of the attribute that retains some portion of the
> original
> >>>>>>>>>> value, but is missing enough context to not identify the
> individual
> >>>>>>>>>> (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data
> is for
> >>>>>>>>>> attributes which are cannot identify the individual.
> >>>>>>>>>>
> >>>>>>>>>> Doctors would be able to read the PRIVATE data, while
> researchers
> >>>>>>>>>> could only read the ANONYMIZED and PUBLIC data. This leads to a
> >>>>>>>>>> question: how much of each kind of data is in the system?
> Without
> >>>>>>>>>> knowing how much data is in the system, how can some application
> >>>>>>>>>> developer (who does not have the ability to read all of the
> PRIVATE
> >>>>>>>>>> data) know that their application is returning an reasonably
> correct
> >>>>>>>>>> amount of data? (there are many examples of questions which
> could be
> >>>>>>>>>> answer on this data alone)
> >>>>>>>>>>
> >>>>>>>>>> Concretely, this histogram would look like (50 records with
> PRIVATE,
> >>>>>>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> >>>>>>>>>>
> >>>>>>>>>> PRIVATE: 50
> >>>>>>>>>> ANONYMIZED: 50
> >>>>>>>>>> PUBLIC: 20
> >>>>>>>>>>
> >>>>>>>>>> Technically, I think this would actually be relatively simple to
> >>>>>>>>>> implement. Inside of each RFile, we could maintain some
> histogram of
> >>>>>>>>>> the visibilities observed in that file. This would allow us to
> very
> >>>>>>>>>> easily report how much data in each table has each visibility
> label.
> >>>>>>>>>>
> >>>>>>>>>> However, would this feature be harmful to one of the core
> tenants of
> >>>>>>>>>> Accumulo? Or, is acknowledging the existence of data in
> Accumulo with
> >>>>>>>>>> a certain visibility acceptable? Would a new permission to use
> such
> >>>>>>>>>> an
> >>>>>>>>>> API to access this information be sufficient to protect the
> data?
> >>>>>>>>>>
> >>>>>>>>>> *   Josh
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Josh Elser <jo...@gmail.com>.
I was envisioning public API protected by a system permission (implying 
some Thrift RPC as well) if that is an important distinction for those 
with concerns. I am hoping to get more info from Mike/Marc about why 
they feel this is insufficient WRT Accumulo's security model.

Keith Turner wrote:
> We did discuss making this info available through the public API (and
> adding thrift calls to gather it).   We discussed the possibility of
> adding a new permission.
>
> On Wed, Oct 12, 2016 at 2:35 PM, ivan bella<iv...@ivan.bella.name>  wrote:
>> I do not see how this invalidates any security of the system unless you are summarizing these counters and making them available through a thrift or other call; don't do that unless other security is put in place.  To get a summary I would think you would have to use a separate utility to scrape the rfiles.  This metadata should only be accessible to a system administrator.  The BIG presumption here is that is is significantly faster to grab this metadata data out than it is to scan all of the keys in the rfile.
>>
>>
>>> On October 12, 2016 at 1:41 PM Josh Elser<jo...@gmail.com>  wrote:
>>>
>>> Thanks, Marc. Follow-on question(s) for you:
>>>
>>> Do you think _any_ such approach should never be pursued by Accumulo
>>> (reading into your other replies about doing it outside of Accumulo)?
>>> Are the permissions that we have in place not sufficient to protect such
>>> "metadata"?
>>>
>>> Or, would such a feature be "OK" to you if it required some degree of
>>> additional manual steps by the administrator? (if so, what steps do you
>>> think make this acceptable)
>>>
>>> In a similar vein, how do you see this broadening the scope of the
>>> Accumulo security model in an invalid manner? e.g. Administrators should
>>> never be able to see such information. Someone with sufficient access to
>>> a system would already be able to bypass Accumulo's security mechanisms.
>>> There are a number of vectors already were a sufficiently-credentialed
>>> individual could figure out this information (and more).
>>>
>>> Ultimately, I see Accumulo's main security tenet as "users should never
>>> be allowed to see more data than they are authorized to see". Maybe it's
>>> my interpretation of that or the scope of how your think the proposed
>>> feature would function, but I'd be very interested in hearing more about
>>> what you think.
>>>
>>> Marc P. wrote:
>>>
>>>> My point for discussing implementation outside of accumulo is because I
>>>> think it does invalidate a core tenant
>>>>
>>>> On Wed, Oct 12, 2016, 12:26 PM Josh Elser<jo...@gmail.com>  wrote:
>>>>
>>>>> Again, can we please bring this discussion back from discussions of
>>>>> implementations to security?
>>>>>
>>>>> Does the fact that you three were discussing implementations imply that
>>>>> you do not think this invalidates one of the core tenets (security
>>>>> first) of Accumulo?
>>>>>
>>>>> Christopher wrote:
>>>>>
>>>>>> Keith, Russ, myself (and possible others) were discussing this at the
>>>>>> hackathon after the Accumulo Summit, and I think our consensus were
>>>>>> basically this:
>>>>>>
>>>>>> We need a generic pluggable mechanism for injecting arbitrary user
>>>>>> counters
>>>>>> into the RFiles. We can then use these counters in custom compaction
>>>>>> strategies, or other analysis. We can aggregate these counters at the
>>>>>> tablet, and table levels, and expose them in the API.
>>>>>>
>>>>>> These counters could store information about visibility frequencies,
>>>>>> number
>>>>>> of delete entries, etc.
>>>>>>
>>>>>> The interface might just be a Function<Entry<Key,Value>,Map<String, Long>>.
>>>>>> In the discussion, there were lots of variations on the theme, though.
>>>>>> So,
>>>>>> the actual implementation could vary. But, having something like this
>>>>>> could
>>>>>> support a large number of use cases beyond just the histogram case.
>>>>>>
>>>>>> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<jo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Trivially. We could do something more intelligent like also cache it in
>>>>>>> metadata (updating with compactions). Don't read too much into the
>>>>>>> implementation at this point; it was just the first idea I had about
>>>>>>> how we
>>>>>>> could do it :). I'm more concerned with the idea and its security
>>>>>>> implications right now.
>>>>>>>
>>>>>>> In general, it seems like people are ok with it protected by a new
>>>>>>> permission role. Do you have more to add, Mike? Was your comment based
>>>>>>> on
>>>>>>> your interpretation of how Accumulo works or more a concern about
>>>>>>> implementing such a feature?
>>>>>>>
>>>>>>> On Oct 11, 2016 21:29,<dl...@comcast.net>  wrote:
>>>>>>>
>>>>>>>> So, to get the set of visibilities used in a table, we would have to
>>>>>>>> open
>>>>>>>> all of the rfiles?
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>>>>>>>>> Sent: Tuesday, October 11, 2016 3:43 PM
>>>>>>>>> To: Accumulo Dev List
>>>>>>>>> Subject: Re: [DISCUSS] Would a visibility histogram on a table be
>>>>>>>>> harmful?
>>>>>>>>> Interesting idea. It begs the question: should we allow any custom
>>>>>>>>> index at
>>>>>>>>> the RFile level? If RFile indexes were user-extensible, then a
>>>>>>>>> visibility index
>>>>>>>>> would be something any developer could write. That said, we can still
>>>>>>>>> include such an index as an example, and if we did it could be used by
>>>>>>>>> the
>>>>>>>>> Accumulo monitor.
>>>>>>>>>
>>>>>>>>> The RFile-level sampling followed this path. I would support further
>>>>>>>>> work
>>>>>>>>> similar to it, though I admit I don't know how difficult a job it
>>>>>>>>> entails.
>>>>>>>>> Bonus points if the index information could be accessed from iterators
>>>>>>>>> the
>>>>>>>>> same way that sampled data can.
>>>>>>>>>
>>>>>>>>> I can't speak to the appropriateness of visibility histograms on the
>>>>>>>>> monitor
>>>>>>>>> *by default*, but it would be a strictly useful feature if it could be
>>>>>>>>> enabled via
>>>>>>>>> a conf option.
>>>>>>>>>
>>>>>>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<jo...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
>>>>>>>>>> he
>>>>>>>>>> mentioned was the lack of insight into the distribution of data
>>>>>>>>>> marked
>>>>>>>>>> with certain visibilities in a table. He presented an example similar
>>>>>>>>>> to this:
>>>>>>>>>> Image a hypothetical system backed by Accumulo which stores medical
>>>>>>>>>> information. There are three labels in the system: PRIVATE,
>>>>>>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
>>>>>>>>>> be
>>>>>>>>>> considered to identify the individual. ANONYMIZED data is some
>>>>>>>>>> altered
>>>>>>>>>> version of the attribute that retains some portion of the original
>>>>>>>>>> value, but is missing enough context to not identify the individual
>>>>>>>>>> (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
>>>>>>>>>> attributes which are cannot identify the individual.
>>>>>>>>>>
>>>>>>>>>> Doctors would be able to read the PRIVATE data, while researchers
>>>>>>>>>> could only read the ANONYMIZED and PUBLIC data. This leads to a
>>>>>>>>>> question: how much of each kind of data is in the system? Without
>>>>>>>>>> knowing how much data is in the system, how can some application
>>>>>>>>>> developer (who does not have the ability to read all of the PRIVATE
>>>>>>>>>> data) know that their application is returning an reasonably correct
>>>>>>>>>> amount of data? (there are many examples of questions which could be
>>>>>>>>>> answer on this data alone)
>>>>>>>>>>
>>>>>>>>>> Concretely, this histogram would look like (50 records with PRIVATE,
>>>>>>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>>>>>>>>>>
>>>>>>>>>> PRIVATE: 50
>>>>>>>>>> ANONYMIZED: 50
>>>>>>>>>> PUBLIC: 20
>>>>>>>>>>
>>>>>>>>>> Technically, I think this would actually be relatively simple to
>>>>>>>>>> implement. Inside of each RFile, we could maintain some histogram of
>>>>>>>>>> the visibilities observed in that file. This would allow us to very
>>>>>>>>>> easily report how much data in each table has each visibility label.
>>>>>>>>>>
>>>>>>>>>> However, would this feature be harmful to one of the core tenants of
>>>>>>>>>> Accumulo? Or, is acknowledging the existence of data in Accumulo with
>>>>>>>>>> a certain visibility acceptable? Would a new permission to use such
>>>>>>>>>> an
>>>>>>>>>> API to access this information be sufficient to protect the data?
>>>>>>>>>>
>>>>>>>>>> *   Josh

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Keith Turner <ke...@deenlo.com>.
We did discuss making this info available through the public API (and
adding thrift calls to gather it).   We discussed the possibility of
adding a new permission.

On Wed, Oct 12, 2016 at 2:35 PM, ivan bella <iv...@ivan.bella.name> wrote:
> I do not see how this invalidates any security of the system unless you are summarizing these counters and making them available through a thrift or other call; don't do that unless other security is put in place.  To get a summary I would think you would have to use a separate utility to scrape the rfiles.  This metadata should only be accessible to a system administrator.  The BIG presumption here is that is is significantly faster to grab this metadata data out than it is to scan all of the keys in the rfile.
>
>
>> On October 12, 2016 at 1:41 PM Josh Elser <jo...@gmail.com> wrote:
>>
>> Thanks, Marc. Follow-on question(s) for you:
>>
>> Do you think _any_ such approach should never be pursued by Accumulo
>> (reading into your other replies about doing it outside of Accumulo)?
>> Are the permissions that we have in place not sufficient to protect such
>> "metadata"?
>>
>> Or, would such a feature be "OK" to you if it required some degree of
>> additional manual steps by the administrator? (if so, what steps do you
>> think make this acceptable)
>>
>> In a similar vein, how do you see this broadening the scope of the
>> Accumulo security model in an invalid manner? e.g. Administrators should
>> never be able to see such information. Someone with sufficient access to
>> a system would already be able to bypass Accumulo's security mechanisms.
>> There are a number of vectors already were a sufficiently-credentialed
>> individual could figure out this information (and more).
>>
>> Ultimately, I see Accumulo's main security tenet as "users should never
>> be allowed to see more data than they are authorized to see". Maybe it's
>> my interpretation of that or the scope of how your think the proposed
>> feature would function, but I'd be very interested in hearing more about
>> what you think.
>>
>> Marc P. wrote:
>>
>> > My point for discussing implementation outside of accumulo is because I
>> > think it does invalidate a core tenant
>> >
>> > On Wed, Oct 12, 2016, 12:26 PM Josh Elser<jo...@gmail.com> wrote:
>> >
>> > > Again, can we please bring this discussion back from discussions of
>> > > implementations to security?
>> > >
>> > > Does the fact that you three were discussing implementations imply that
>> > > you do not think this invalidates one of the core tenets (security
>> > > first) of Accumulo?
>> > >
>> > > Christopher wrote:
>> > >
>> > > > Keith, Russ, myself (and possible others) were discussing this at the
>> > > > hackathon after the Accumulo Summit, and I think our consensus were
>> > > > basically this:
>> > > >
>> > > > We need a generic pluggable mechanism for injecting arbitrary user
>> > > > counters
>> > > > into the RFiles. We can then use these counters in custom compaction
>> > > > strategies, or other analysis. We can aggregate these counters at the
>> > > > tablet, and table levels, and expose them in the API.
>> > > >
>> > > > These counters could store information about visibility frequencies,
>> > > > number
>> > > > of delete entries, etc.
>> > > >
>> > > > The interface might just be a Function<Entry<Key,Value>,Map<String, Long>>.
>> > > > In the discussion, there were lots of variations on the theme, though.
>> > > > So,
>> > > > the actual implementation could vary. But, having something like this
>> > > > could
>> > > > support a large number of use cases beyond just the histogram case.
>> > > >
>> > > > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<jo...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Trivially. We could do something more intelligent like also cache it in
>> > > > > metadata (updating with compactions). Don't read too much into the
>> > > > > implementation at this point; it was just the first idea I had about
>> > > > > how we
>> > > > > could do it :). I'm more concerned with the idea and its security
>> > > > > implications right now.
>> > > > >
>> > > > > In general, it seems like people are ok with it protected by a new
>> > > > > permission role. Do you have more to add, Mike? Was your comment based
>> > > > > on
>> > > > > your interpretation of how Accumulo works or more a concern about
>> > > > > implementing such a feature?
>> > > > >
>> > > > > On Oct 11, 2016 21:29,<dl...@comcast.net> wrote:
>> > > > >
>> > > > > > So, to get the set of visibilities used in a table, we would have to
>> > > > > > open
>> > > > > > all of the rfiles?
>> > > > > >
>> > > > > > > -----Original Message-----
>> > > > > > > From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>> > > > > > > Sent: Tuesday, October 11, 2016 3:43 PM
>> > > > > > > To: Accumulo Dev List
>> > > > > > > Subject: Re: [DISCUSS] Would a visibility histogram on a table be
>> > > > > > > harmful?
>> > > > > > > Interesting idea. It begs the question: should we allow any custom
>> > > > > > > index at
>> > > > > > > the RFile level? If RFile indexes were user-extensible, then a
>> > > > > > > visibility index
>> > > > > > > would be something any developer could write. That said, we can still
>> > > > > > > include such an index as an example, and if we did it could be used by
>> > > > > > > the
>> > > > > > > Accumulo monitor.
>> > > > > > >
>> > > > > > > The RFile-level sampling followed this path. I would support further
>> > > > > > > work
>> > > > > > > similar to it, though I admit I don't know how difficult a job it
>> > > > > > > entails.
>> > > > > > > Bonus points if the index information could be accessed from iterators
>> > > > > > > the
>> > > > > > > same way that sampled data can.
>> > > > > > >
>> > > > > > > I can't speak to the appropriateness of visibility histograms on the
>> > > > > > > monitor
>> > > > > > > *by default*, but it would be a strictly useful feature if it could be
>> > > > > > > enabled via
>> > > > > > > a conf option.
>> > > > > > >
>> > > > > > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<jo...@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
>> > > > > > > > he
>> > > > > > > > mentioned was the lack of insight into the distribution of data
>> > > > > > > > marked
>> > > > > > > > with certain visibilities in a table. He presented an example similar
>> > > > > > > > to this:
>> > > > > > > > Image a hypothetical system backed by Accumulo which stores medical
>> > > > > > > > information. There are three labels in the system: PRIVATE,
>> > > > > > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
>> > > > > > > > be
>> > > > > > > > considered to identify the individual. ANONYMIZED data is some
>> > > > > > > > altered
>> > > > > > > > version of the attribute that retains some portion of the original
>> > > > > > > > value, but is missing enough context to not identify the individual
>> > > > > > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
>> > > > > > > > attributes which are cannot identify the individual.
>> > > > > > > >
>> > > > > > > > Doctors would be able to read the PRIVATE data, while researchers
>> > > > > > > > could only read the ANONYMIZED and PUBLIC data. This leads to a
>> > > > > > > > question: how much of each kind of data is in the system? Without
>> > > > > > > > knowing how much data is in the system, how can some application
>> > > > > > > > developer (who does not have the ability to read all of the PRIVATE
>> > > > > > > > data) know that their application is returning an reasonably correct
>> > > > > > > > amount of data? (there are many examples of questions which could be
>> > > > > > > > answer on this data alone)
>> > > > > > > >
>> > > > > > > > Concretely, this histogram would look like (50 records with PRIVATE,
>> > > > > > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>> > > > > > > >
>> > > > > > > > PRIVATE: 50
>> > > > > > > > ANONYMIZED: 50
>> > > > > > > > PUBLIC: 20
>> > > > > > > >
>> > > > > > > > Technically, I think this would actually be relatively simple to
>> > > > > > > > implement. Inside of each RFile, we could maintain some histogram of
>> > > > > > > > the visibilities observed in that file. This would allow us to very
>> > > > > > > > easily report how much data in each table has each visibility label.
>> > > > > > > >
>> > > > > > > > However, would this feature be harmful to one of the core tenants of
>> > > > > > > > Accumulo? Or, is acknowledging the existence of data in Accumulo with
>> > > > > > > > a certain visibility acceptable? Would a new permission to use such
>> > > > > > > > an
>> > > > > > > > API to access this information be sufficient to protect the data?
>> > > > > > > >
>> > > > > > > > *   Josh

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by ivan bella <iv...@ivan.bella.name>.
I do not see how this invalidates any security of the system unless you are summarizing these counters and making them available through a thrift or other call; don't do that unless other security is put in place.  To get a summary I would think you would have to use a separate utility to scrape the rfiles.  This metadata should only be accessible to a system administrator.  The BIG presumption here is that is is significantly faster to grab this metadata data out than it is to scan all of the keys in the rfile.


> On October 12, 2016 at 1:41 PM Josh Elser <jo...@gmail.com> wrote:
> 
> Thanks, Marc. Follow-on question(s) for you:
> 
> Do you think _any_ such approach should never be pursued by Accumulo
> (reading into your other replies about doing it outside of Accumulo)?
> Are the permissions that we have in place not sufficient to protect such
> "metadata"?
> 
> Or, would such a feature be "OK" to you if it required some degree of
> additional manual steps by the administrator? (if so, what steps do you
> think make this acceptable)
> 
> In a similar vein, how do you see this broadening the scope of the
> Accumulo security model in an invalid manner? e.g. Administrators should
> never be able to see such information. Someone with sufficient access to
> a system would already be able to bypass Accumulo's security mechanisms.
> There are a number of vectors already were a sufficiently-credentialed
> individual could figure out this information (and more).
> 
> Ultimately, I see Accumulo's main security tenet as "users should never
> be allowed to see more data than they are authorized to see". Maybe it's
> my interpretation of that or the scope of how your think the proposed
> feature would function, but I'd be very interested in hearing more about
> what you think.
> 
> Marc P. wrote:
> 
> > My point for discussing implementation outside of accumulo is because I
> > think it does invalidate a core tenant
> > 
> > On Wed, Oct 12, 2016, 12:26 PM Josh Elser<jo...@gmail.com> wrote:
> > 
> > > Again, can we please bring this discussion back from discussions of
> > > implementations to security?
> > > 
> > > Does the fact that you three were discussing implementations imply that
> > > you do not think this invalidates one of the core tenets (security
> > > first) of Accumulo?
> > > 
> > > Christopher wrote:
> > > 
> > > > Keith, Russ, myself (and possible others) were discussing this at the
> > > > hackathon after the Accumulo Summit, and I think our consensus were
> > > > basically this:
> > > > 
> > > > We need a generic pluggable mechanism for injecting arbitrary user
> > > > counters
> > > > into the RFiles. We can then use these counters in custom compaction
> > > > strategies, or other analysis. We can aggregate these counters at the
> > > > tablet, and table levels, and expose them in the API.
> > > > 
> > > > These counters could store information about visibility frequencies,
> > > > number
> > > > of delete entries, etc.
> > > > 
> > > > The interface might just be a Function<Entry<Key,Value>,Map<String, Long>>.
> > > > In the discussion, there were lots of variations on the theme, though.
> > > > So,
> > > > the actual implementation could vary. But, having something like this
> > > > could
> > > > support a large number of use cases beyond just the histogram case.
> > > > 
> > > > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<jo...@gmail.com>
> > > > wrote:
> > > > 
> > > > > Trivially. We could do something more intelligent like also cache it in
> > > > > metadata (updating with compactions). Don't read too much into the
> > > > > implementation at this point; it was just the first idea I had about
> > > > > how we
> > > > > could do it :). I'm more concerned with the idea and its security
> > > > > implications right now.
> > > > > 
> > > > > In general, it seems like people are ok with it protected by a new
> > > > > permission role. Do you have more to add, Mike? Was your comment based
> > > > > on
> > > > > your interpretation of how Accumulo works or more a concern about
> > > > > implementing such a feature?
> > > > > 
> > > > > On Oct 11, 2016 21:29,<dl...@comcast.net> wrote:
> > > > > 
> > > > > > So, to get the set of visibilities used in a table, we would have to
> > > > > > open
> > > > > > all of the rfiles?
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
> > > > > > > Sent: Tuesday, October 11, 2016 3:43 PM
> > > > > > > To: Accumulo Dev List
> > > > > > > Subject: Re: [DISCUSS] Would a visibility histogram on a table be
> > > > > > > harmful?
> > > > > > > Interesting idea. It begs the question: should we allow any custom
> > > > > > > index at
> > > > > > > the RFile level? If RFile indexes were user-extensible, then a
> > > > > > > visibility index
> > > > > > > would be something any developer could write. That said, we can still
> > > > > > > include such an index as an example, and if we did it could be used by
> > > > > > > the
> > > > > > > Accumulo monitor.
> > > > > > > 
> > > > > > > The RFile-level sampling followed this path. I would support further
> > > > > > > work
> > > > > > > similar to it, though I admit I don't know how difficult a job it
> > > > > > > entails.
> > > > > > > Bonus points if the index information could be accessed from iterators
> > > > > > > the
> > > > > > > same way that sampled data can.
> > > > > > > 
> > > > > > > I can't speak to the appropriateness of visibility histograms on the
> > > > > > > monitor
> > > > > > > *by default*, but it would be a strictly useful feature if it could be
> > > > > > > enabled via
> > > > > > > a conf option.
> > > > > > > 
> > > > > > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<jo...@gmail.com>
> > > > > > > wrote:
> > > > > > > 
> > > > > > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
> > > > > > > > he
> > > > > > > > mentioned was the lack of insight into the distribution of data
> > > > > > > > marked
> > > > > > > > with certain visibilities in a table. He presented an example similar
> > > > > > > > to this:
> > > > > > > > Image a hypothetical system backed by Accumulo which stores medical
> > > > > > > > information. There are three labels in the system: PRIVATE,
> > > > > > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
> > > > > > > > be
> > > > > > > > considered to identify the individual. ANONYMIZED data is some
> > > > > > > > altered
> > > > > > > > version of the attribute that retains some portion of the original
> > > > > > > > value, but is missing enough context to not identify the individual
> > > > > > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
> > > > > > > > attributes which are cannot identify the individual.
> > > > > > > > 
> > > > > > > > Doctors would be able to read the PRIVATE data, while researchers
> > > > > > > > could only read the ANONYMIZED and PUBLIC data. This leads to a
> > > > > > > > question: how much of each kind of data is in the system? Without
> > > > > > > > knowing how much data is in the system, how can some application
> > > > > > > > developer (who does not have the ability to read all of the PRIVATE
> > > > > > > > data) know that their application is returning an reasonably correct
> > > > > > > > amount of data? (there are many examples of questions which could be
> > > > > > > > answer on this data alone)
> > > > > > > > 
> > > > > > > > Concretely, this histogram would look like (50 records with PRIVATE,
> > > > > > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> > > > > > > > 
> > > > > > > > PRIVATE: 50
> > > > > > > > ANONYMIZED: 50
> > > > > > > > PUBLIC: 20
> > > > > > > > 
> > > > > > > > Technically, I think this would actually be relatively simple to
> > > > > > > > implement. Inside of each RFile, we could maintain some histogram of
> > > > > > > > the visibilities observed in that file. This would allow us to very
> > > > > > > > easily report how much data in each table has each visibility label.
> > > > > > > > 
> > > > > > > > However, would this feature be harmful to one of the core tenants of
> > > > > > > > Accumulo? Or, is acknowledging the existence of data in Accumulo with
> > > > > > > > a certain visibility acceptable? Would a new permission to use such
> > > > > > > > an
> > > > > > > > API to access this information be sufficient to protect the data?
> > > > > > > > 
> > > > > > > > *   Josh

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Josh Elser <jo...@gmail.com>.
Thanks, Marc. Follow-on question(s) for you:

Do you think _any_ such approach should never be pursued by Accumulo 
(reading into your other replies about doing it outside of Accumulo)? 
Are the permissions that we have in place not sufficient to protect such 
"metadata"?

Or, would such a feature be "OK" to you if it required some degree of 
additional manual steps by the administrator? (if so, what steps do you 
think make this acceptable)

In a similar vein, how do you see this broadening the scope of the 
Accumulo security model in an invalid manner? e.g. Administrators should 
never be able to see such information. Someone with sufficient access to 
a system would already be able to bypass Accumulo's security mechanisms. 
There are a number of vectors already were a sufficiently-credentialed 
individual could figure out this information (and more).

Ultimately, I see Accumulo's main security tenet as "users should never 
be allowed to see more data than they are authorized to see". Maybe it's 
my interpretation of that or the scope of how your think the proposed 
feature would function, but I'd be very interested in hearing more about 
what you think.

Marc P. wrote:
> My point for discussing implementation outside of accumulo is because I
> think it does invalidate a core tenant
>
> On Wed, Oct 12, 2016, 12:26 PM Josh Elser<jo...@gmail.com>  wrote:
>
>> Again, can we please bring this discussion back from discussions of
>> implementations to security?
>>
>> Does the fact that you three were discussing implementations imply that
>> you do not think this invalidates one of the core tenets (security
>> first) of Accumulo?
>>
>> Christopher wrote:
>>> Keith, Russ, myself (and possible others) were discussing this at the
>>> hackathon after the Accumulo Summit, and I think our consensus were
>>> basically this:
>>>
>>> We need a generic pluggable mechanism for injecting arbitrary user
>> counters
>>> into the RFiles. We can then use these counters in custom compaction
>>> strategies, or other analysis. We can aggregate these counters at the
>>> tablet, and table levels, and expose them in the API.
>>>
>>> These counters could store information about visibility frequencies,
>> number
>>> of delete entries, etc.
>>>
>>> The interface might just be a Function<Entry<Key,Value>,Map<String,
>> Long>>.
>>> In the discussion, there were lots of variations on the theme, though.
>> So,
>>> the actual implementation could vary. But, having something like this
>> could
>>> support a large number of use cases beyond just the histogram case.
>>>
>>> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<jo...@gmail.com>
>> wrote:
>>>> Trivially. We could do something more intelligent like also cache it in
>>>> metadata (updating with compactions). Don't read too much into the
>>>> implementation at this point; it was just the first idea I had about
>> how we
>>>> could do it :). I'm more concerned with the idea and its security
>>>> implications right now.
>>>>
>>>> In general, it seems like people are ok with it protected by a new
>>>> permission role. Do you have more to add, Mike? Was your comment based
>> on
>>>> your interpretation of how Accumulo works or more a concern about
>>>> implementing such a feature?
>>>>
>>>> On Oct 11, 2016 21:29,<dl...@comcast.net>   wrote:
>>>>
>>>>> So, to get the set of visibilities used in a table, we would have to
>> open
>>>>> all of the rfiles?
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>>>>>> Sent: Tuesday, October 11, 2016 3:43 PM
>>>>>> To: Accumulo Dev List
>>>>>> Subject: Re: [DISCUSS] Would a visibility histogram on a table be
>>>>> harmful?
>>>>>> Interesting idea.  It begs the question: should we allow any custom
>>>>> index at
>>>>>> the RFile level?  If RFile indexes were user-extensible, then a
>>>>> visibility index
>>>>>> would be something any developer could write.  That said, we can still
>>>>>> include such an index as an example, and if we did it could be used by
>>>>> the
>>>>>> Accumulo monitor.
>>>>>>
>>>>>> The RFile-level sampling followed this path.  I would support further
>>>>> work
>>>>>> similar to it, though I admit I don't know how difficult a job it
>>>>> entails.
>>>>>> Bonus points if the index information could be accessed from iterators
>>>>> the
>>>>>> same way that sampled data can.
>>>>>>
>>>>>> I can't speak to the appropriateness of visibility histograms on the
>>>>> monitor
>>>>>> *by default*, but it would be a strictly useful feature if it could be
>>>>> enabled via
>>>>>> a conf option.
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<jo...@gmail.com>
>>>>> wrote:
>>>>>>> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
>>>> he
>>>>>>> mentioned was the lack of insight into the distribution of data
>>>> marked
>>>>>>> with certain visibilities in a table. He presented an example similar
>>>>> to this:
>>>>>>> Image a hypothetical system backed by Accumulo which stores medical
>>>>>>> information. There are three labels in the system: PRIVATE,
>>>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
>>>> be
>>>>>>> considered to identify the individual. ANONYMIZED data is some
>>>> altered
>>>>>>> version of the attribute that retains some portion of the original
>>>>>>> value, but is missing enough context to not identify the individual
>>>>>>> (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
>>>>>>> attributes which are cannot identify the individual.
>>>>>>>
>>>>>>> Doctors would be able to read the PRIVATE data, while researchers
>>>>>>> could only read the ANONYMIZED and PUBLIC data. This leads to a
>>>>>>> question: how much of each kind of data is in the system? Without
>>>>>>> knowing how much data is in the system, how can some application
>>>>>>> developer (who does not have the ability to read all of the PRIVATE
>>>>>>> data) know that their application is returning an reasonably correct
>>>>>>> amount of data? (there are many examples of questions which could be
>>>>>>> answer on this data alone)
>>>>>>>
>>>>>>> Concretely, this histogram would look like (50 records with PRIVATE,
>>>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>>>>>>>
>>>>>>> ```
>>>>>>> PRIVATE: 50
>>>>>>> ANONYMIZED: 50
>>>>>>> PUBLIC: 20
>>>>>>> ```
>>>>>>>
>>>>>>> Technically, I think this would actually be relatively simple to
>>>>>>> implement. Inside of each RFile, we could maintain some histogram of
>>>>>>> the visibilities observed in that file. This would allow us to very
>>>>>>> easily report how much data in each table has each visibility label.
>>>>>>>
>>>>>>> However, would this feature be harmful to one of the core tenants of
>>>>>>> Accumulo? Or, is acknowledging the existence of data in Accumulo with
>>>>>>> a certain visibility acceptable? Would a new permission to use such
>>>> an
>>>>>>> API to access this information be sufficient to protect the data?
>>>>>>>
>>>>>>> - Josh
>>>>>>>
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by "Marc P." <ma...@gmail.com>.
My point for discussing implementation outside of accumulo is because I
think it does invalidate a core tenant

On Wed, Oct 12, 2016, 12:26 PM Josh Elser <jo...@gmail.com> wrote:

> Again, can we please bring this discussion back from discussions of
> implementations to security?
>
> Does the fact that you three were discussing implementations imply that
> you do not think this invalidates one of the core tenets (security
> first) of Accumulo?
>
> Christopher wrote:
> > Keith, Russ, myself (and possible others) were discussing this at the
> > hackathon after the Accumulo Summit, and I think our consensus were
> > basically this:
> >
> > We need a generic pluggable mechanism for injecting arbitrary user
> counters
> > into the RFiles. We can then use these counters in custom compaction
> > strategies, or other analysis. We can aggregate these counters at the
> > tablet, and table levels, and expose them in the API.
> >
> > These counters could store information about visibility frequencies,
> number
> > of delete entries, etc.
> >
> > The interface might just be a Function<Entry<Key,Value>,Map<String,
> Long>>.
> >
> > In the discussion, there were lots of variations on the theme, though.
> So,
> > the actual implementation could vary. But, having something like this
> could
> > support a large number of use cases beyond just the histogram case.
> >
> > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<jo...@gmail.com>
> wrote:
> >
> >> Trivially. We could do something more intelligent like also cache it in
> >> metadata (updating with compactions). Don't read too much into the
> >> implementation at this point; it was just the first idea I had about
> how we
> >> could do it :). I'm more concerned with the idea and its security
> >> implications right now.
> >>
> >> In general, it seems like people are ok with it protected by a new
> >> permission role. Do you have more to add, Mike? Was your comment based
> on
> >> your interpretation of how Accumulo works or more a concern about
> >> implementing such a feature?
> >>
> >> On Oct 11, 2016 21:29,<dl...@comcast.net>  wrote:
> >>
> >>> So, to get the set of visibilities used in a table, we would have to
> open
> >>> all of the rfiles?
> >>>
> >>>> -----Original Message-----
> >>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
> >>>> Sent: Tuesday, October 11, 2016 3:43 PM
> >>>> To: Accumulo Dev List
> >>>> Subject: Re: [DISCUSS] Would a visibility histogram on a table be
> >>> harmful?
> >>>> Interesting idea.  It begs the question: should we allow any custom
> >>> index at
> >>>> the RFile level?  If RFile indexes were user-extensible, then a
> >>> visibility index
> >>>> would be something any developer could write.  That said, we can still
> >>>> include such an index as an example, and if we did it could be used by
> >>> the
> >>>> Accumulo monitor.
> >>>>
> >>>> The RFile-level sampling followed this path.  I would support further
> >>> work
> >>>> similar to it, though I admit I don't know how difficult a job it
> >>> entails.
> >>>> Bonus points if the index information could be accessed from iterators
> >>> the
> >>>> same way that sampled data can.
> >>>>
> >>>> I can't speak to the appropriateness of visibility histograms on the
> >>> monitor
> >>>> *by default*, but it would be a strictly useful feature if it could be
> >>> enabled via
> >>>> a conf option.
> >>>>
> >>>>
> >>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<jo...@gmail.com>
> >>> wrote:
> >>>>> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
> >> he
> >>>>> mentioned was the lack of insight into the distribution of data
> >> marked
> >>>>> with certain visibilities in a table. He presented an example similar
> >>> to this:
> >>>>> Image a hypothetical system backed by Accumulo which stores medical
> >>>>> information. There are three labels in the system: PRIVATE,
> >>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
> >> be
> >>>>> considered to identify the individual. ANONYMIZED data is some
> >> altered
> >>>>> version of the attribute that retains some portion of the original
> >>>>> value, but is missing enough context to not identify the individual
> >>>>> (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
> >>>>> attributes which are cannot identify the individual.
> >>>>>
> >>>>> Doctors would be able to read the PRIVATE data, while researchers
> >>>>> could only read the ANONYMIZED and PUBLIC data. This leads to a
> >>>>> question: how much of each kind of data is in the system? Without
> >>>>> knowing how much data is in the system, how can some application
> >>>>> developer (who does not have the ability to read all of the PRIVATE
> >>>>> data) know that their application is returning an reasonably correct
> >>>>> amount of data? (there are many examples of questions which could be
> >>>>> answer on this data alone)
> >>>>>
> >>>>> Concretely, this histogram would look like (50 records with PRIVATE,
> >>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> >>>>>
> >>>>> ```
> >>>>> PRIVATE: 50
> >>>>> ANONYMIZED: 50
> >>>>> PUBLIC: 20
> >>>>> ```
> >>>>>
> >>>>> Technically, I think this would actually be relatively simple to
> >>>>> implement. Inside of each RFile, we could maintain some histogram of
> >>>>> the visibilities observed in that file. This would allow us to very
> >>>>> easily report how much data in each table has each visibility label.
> >>>>>
> >>>>> However, would this feature be harmful to one of the core tenants of
> >>>>> Accumulo? Or, is acknowledging the existence of data in Accumulo with
> >>>>> a certain visibility acceptable? Would a new permission to use such
> >> an
> >>>>> API to access this information be sufficient to protect the data?
> >>>>>
> >>>>> - Josh
> >>>>>
> >>>
> >
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Josh Elser <jo...@gmail.com>.
Again, can we please bring this discussion back from discussions of 
implementations to security?

Does the fact that you three were discussing implementations imply that 
you do not think this invalidates one of the core tenets (security 
first) of Accumulo?

Christopher wrote:
> Keith, Russ, myself (and possible others) were discussing this at the
> hackathon after the Accumulo Summit, and I think our consensus were
> basically this:
>
> We need a generic pluggable mechanism for injecting arbitrary user counters
> into the RFiles. We can then use these counters in custom compaction
> strategies, or other analysis. We can aggregate these counters at the
> tablet, and table levels, and expose them in the API.
>
> These counters could store information about visibility frequencies, number
> of delete entries, etc.
>
> The interface might just be a Function<Entry<Key,Value>,Map<String, Long>>.
>
> In the discussion, there were lots of variations on the theme, though. So,
> the actual implementation could vary. But, having something like this could
> support a large number of use cases beyond just the histogram case.
>
> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<jo...@gmail.com>  wrote:
>
>> Trivially. We could do something more intelligent like also cache it in
>> metadata (updating with compactions). Don't read too much into the
>> implementation at this point; it was just the first idea I had about how we
>> could do it :). I'm more concerned with the idea and its security
>> implications right now.
>>
>> In general, it seems like people are ok with it protected by a new
>> permission role. Do you have more to add, Mike? Was your comment based on
>> your interpretation of how Accumulo works or more a concern about
>> implementing such a feature?
>>
>> On Oct 11, 2016 21:29,<dl...@comcast.net>  wrote:
>>
>>> So, to get the set of visibilities used in a table, we would have to open
>>> all of the rfiles?
>>>
>>>> -----Original Message-----
>>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>>>> Sent: Tuesday, October 11, 2016 3:43 PM
>>>> To: Accumulo Dev List
>>>> Subject: Re: [DISCUSS] Would a visibility histogram on a table be
>>> harmful?
>>>> Interesting idea.  It begs the question: should we allow any custom
>>> index at
>>>> the RFile level?  If RFile indexes were user-extensible, then a
>>> visibility index
>>>> would be something any developer could write.  That said, we can still
>>>> include such an index as an example, and if we did it could be used by
>>> the
>>>> Accumulo monitor.
>>>>
>>>> The RFile-level sampling followed this path.  I would support further
>>> work
>>>> similar to it, though I admit I don't know how difficult a job it
>>> entails.
>>>> Bonus points if the index information could be accessed from iterators
>>> the
>>>> same way that sampled data can.
>>>>
>>>> I can't speak to the appropriateness of visibility histograms on the
>>> monitor
>>>> *by default*, but it would be a strictly useful feature if it could be
>>> enabled via
>>>> a conf option.
>>>>
>>>>
>>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<jo...@gmail.com>
>>> wrote:
>>>>> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
>> he
>>>>> mentioned was the lack of insight into the distribution of data
>> marked
>>>>> with certain visibilities in a table. He presented an example similar
>>> to this:
>>>>> Image a hypothetical system backed by Accumulo which stores medical
>>>>> information. There are three labels in the system: PRIVATE,
>>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
>> be
>>>>> considered to identify the individual. ANONYMIZED data is some
>> altered
>>>>> version of the attribute that retains some portion of the original
>>>>> value, but is missing enough context to not identify the individual
>>>>> (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
>>>>> attributes which are cannot identify the individual.
>>>>>
>>>>> Doctors would be able to read the PRIVATE data, while researchers
>>>>> could only read the ANONYMIZED and PUBLIC data. This leads to a
>>>>> question: how much of each kind of data is in the system? Without
>>>>> knowing how much data is in the system, how can some application
>>>>> developer (who does not have the ability to read all of the PRIVATE
>>>>> data) know that their application is returning an reasonably correct
>>>>> amount of data? (there are many examples of questions which could be
>>>>> answer on this data alone)
>>>>>
>>>>> Concretely, this histogram would look like (50 records with PRIVATE,
>>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>>>>>
>>>>> ```
>>>>> PRIVATE: 50
>>>>> ANONYMIZED: 50
>>>>> PUBLIC: 20
>>>>> ```
>>>>>
>>>>> Technically, I think this would actually be relatively simple to
>>>>> implement. Inside of each RFile, we could maintain some histogram of
>>>>> the visibilities observed in that file. This would allow us to very
>>>>> easily report how much data in each table has each visibility label.
>>>>>
>>>>> However, would this feature be harmful to one of the core tenants of
>>>>> Accumulo? Or, is acknowledging the existence of data in Accumulo with
>>>>> a certain visibility acceptable? Would a new permission to use such
>> an
>>>>> API to access this information be sufficient to protect the data?
>>>>>
>>>>> - Josh
>>>>>
>>>
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Keith Turner <ke...@deenlo.com>.
On Wed, Oct 12, 2016 at 10:40 AM, ivan bella <iv...@ivan.bella.name> wrote:
> Yes the "owners" could create a visibility counting mechanism separately, however if we make this RFile metadata a part of the system then we increase the "ease of use".  Unfortunately, system designers rarely think about the metadata they need from their system up front. That being said, if the performance impact of this is significant then it needs to be made optional or we leave it as is.

We started dicussing generalized counting mechnism as Christopher
mentioned.  This would be created every time a file is compacted, so
it would help with the deciding up front issue.     It would have to
open all rfiles in the range.  This information could be cached in the
tserver index cache.

Below are some of the things discussed.

 * The class (and its config ) used to generate the counts is stored in RFile
 * When a user request counts, they must specify the class and config
they expect the counts to have been generated with.
 * Must decide behaviour when a tablet has RFiles with counts
generated in different ways.  Could error or return partial results.
 * Must decide behaviour when there are too many counters.  Could cap
the number of counters. When an RFile has capped counters, could
either error or return partial results.
 * If partial results are returned, then API must provide way to
indicate this to user.
 * Need to decide it counters should be maintained for data in memory.

>
>> On October 12, 2016 at 7:12 AM "Marc P." <ma...@gmail.com> wrote:
>>
>> What prevents the owners of the system from doing this in their own table?
>> Keeping track of that information is a use case of Accumulo. I think this
>> may be an example of external code that the user must install. Placing the
>> onus on the consumer mitigates concern that Mike "Mike" Drob and others may
>> have .
>>
>> A new role wouldn't be needed if permissions were placed on the
>> user/table/namespace that stored this information, correct?
>>
>> On Wed, Oct 12, 2016 at 12:56 AM, Christopher <ct...@apache.org> wrote:
>>
>> > Keith, Russ, myself (and possible others) were discussing this at the
>> > hackathon after the Accumulo Summit, and I think our consensus were
>> > basically this:
>> >
>> > We need a generic pluggable mechanism for injecting arbitrary user counters
>> > into the RFiles. We can then use these counters in custom compaction
>> > strategies, or other analysis. We can aggregate these counters at the
>> > tablet, and table levels, and expose them in the API.
>> >
>> > These counters could store information about visibility frequencies, number
>> > of delete entries, etc.
>> >
>> > The interface might just be a Function<Entry<Key,Value>,Map<String, Long>>.
>> >
>> > In the discussion, there were lots of variations on the theme, though. So,
>> > the actual implementation could vary. But, having something like this could
>> > support a large number of use cases beyond just the histogram case.
>> >
>> > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser <jo...@gmail.com> wrote:
>> >
>> > > Trivially. We could do something more intelligent like also cache it in
>> > > metadata (updating with compactions). Don't read too much into the
>> > > implementation at this point; it was just the first idea I had about how
>> > > we
>> > > could do it :). I'm more concerned with the idea and its security
>> > > implications right now.
>> > >
>> > > In general, it seems like people are ok with it protected by a new
>> > > permission role. Do you have more to add, Mike? Was your comment based on
>> > > your interpretation of how Accumulo works or more a concern about
>> > > implementing such a feature?
>> > >
>> > > On Oct 11, 2016 21:29, <dl...@comcast.net> wrote:
>> > >
>> > > > So, to get the set of visibilities used in a table, we would have to
>> > > > open
>> > > > all of the rfiles?
>> > > >
>> > > > > -----Original Message-----
>> > > > > From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>> > > > > Sent: Tuesday, October 11, 2016 3:43 PM
>> > > > > To: Accumulo Dev List
>> > > > > Subject: Re: [DISCUSS] Would a visibility histogram on a table be
>> > > > > harmful?
>> > > > >
>> > > > > Interesting idea. It begs the question: should we allow any custom
>> > > > > index at
>> > > > > the RFile level? If RFile indexes were user-extensible, then a
>> > > > > visibility index
>> > > > > would be something any developer could write. That said, we can
>> > > > > still
>> > > > > include such an index as an example, and if we did it could be used
>> > > > > by
>> > > > > the
>> > > > > Accumulo monitor.
>> > > > >
>> > > > > The RFile-level sampling followed this path. I would support further
>> > > > > work
>> > > > > similar to it, though I admit I don't know how difficult a job it
>> > > > > entails.
>> > > > > Bonus points if the index information could be accessed from
>> > > > > iterators
>> > > > > the
>> > > > > same way that sampled data can.
>> > > > >
>> > > > > I can't speak to the appropriateness of visibility histograms on the
>> > > > > monitor
>> > > > > *by default*, but it would be a strictly useful feature if it could
>> > > > > be
>> > > > > enabled via
>> > > > > a conf option.
>> > > > >
>> > > > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <jo...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
>> > > > > > he
>> > > > > > mentioned was the lack of insight into the distribution of data
>> > > > > > marked
>> > > > > > with certain visibilities in a table. He presented an example
>> > > > > > similar
>> > > > > > to this:
>> > > > > >
>> > > > > > Image a hypothetical system backed by Accumulo which stores medical
>> > > > > > information. There are three labels in the system: PRIVATE,
>> > > > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
>> > > > > > be
>> > > > > > considered to identify the individual. ANONYMIZED data is some
>> > > > > > altered
>> > > > > > version of the attribute that retains some portion of the original
>> > > > > > value, but is missing enough context to not identify the individual
>> > > > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is
>> > > > > > for
>> > > > > > attributes which are cannot identify the individual.
>> > > > > >
>> > > > > > Doctors would be able to read the PRIVATE data, while researchers
>> > > > > > could only read the ANONYMIZED and PUBLIC data. This leads to a
>> > > > > > question: how much of each kind of data is in the system? Without
>> > > > > > knowing how much data is in the system, how can some application
>> > > > > > developer (who does not have the ability to read all of the PRIVATE
>> > > > > > data) know that their application is returning an reasonably
>> > > > > > correct
>> > > > > > amount of data? (there are many examples of questions which could
>> > > > > > be
>> > > > > > answer on this data alone)
>> > > > > >
>> > > > > > Concretely, this histogram would look like (50 records with
>> > > > > > PRIVATE,
>> > > > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>> > > > > >
>> > > > > > PRIVATE: 50
>> > > > > > ANONYMIZED: 50
>> > > > > > PUBLIC: 20
>> > > > > >
>> > > > > > Technically, I think this would actually be relatively simple to
>> > > > > > implement. Inside of each RFile, we could maintain some histogram
>> > > > > > of
>> > > > > > the visibilities observed in that file. This would allow us to very
>> > > > > > easily report how much data in each table has each visibility
>> > > > > > label.
>> > > > > >
>> > > > > > However, would this feature be harmful to one of the core tenants
>> > > > > > of
>> > > > > > Accumulo? Or, is acknowledging the existence of data in Accumulo
>> > > > > > with
>> > > > > > a certain visibility acceptable? Would a new permission to use such
>> > > > > > an
>> > > > > > API to access this information be sufficient to protect the data?
>> > > > > >
>> > > > > > *   Josh

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by ivan bella <iv...@ivan.bella.name>.
Yes the "owners" could create a visibility counting mechanism separately, however if we make this RFile metadata a part of the system then we increase the "ease of use".  Unfortunately, system designers rarely think about the metadata they need from their system up front. That being said, if the performance impact of this is significant then it needs to be made optional or we leave it as is.

> On October 12, 2016 at 7:12 AM "Marc P." <ma...@gmail.com> wrote:
> 
> What prevents the owners of the system from doing this in their own table?
> Keeping track of that information is a use case of Accumulo. I think this
> may be an example of external code that the user must install. Placing the
> onus on the consumer mitigates concern that Mike "Mike" Drob and others may
> have .
> 
> A new role wouldn't be needed if permissions were placed on the
> user/table/namespace that stored this information, correct?
> 
> On Wed, Oct 12, 2016 at 12:56 AM, Christopher <ct...@apache.org> wrote:
> 
> > Keith, Russ, myself (and possible others) were discussing this at the
> > hackathon after the Accumulo Summit, and I think our consensus were
> > basically this:
> > 
> > We need a generic pluggable mechanism for injecting arbitrary user counters
> > into the RFiles. We can then use these counters in custom compaction
> > strategies, or other analysis. We can aggregate these counters at the
> > tablet, and table levels, and expose them in the API.
> > 
> > These counters could store information about visibility frequencies, number
> > of delete entries, etc.
> > 
> > The interface might just be a Function<Entry<Key,Value>,Map<String, Long>>.
> > 
> > In the discussion, there were lots of variations on the theme, though. So,
> > the actual implementation could vary. But, having something like this could
> > support a large number of use cases beyond just the histogram case.
> > 
> > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser <jo...@gmail.com> wrote:
> > 
> > > Trivially. We could do something more intelligent like also cache it in
> > > metadata (updating with compactions). Don't read too much into the
> > > implementation at this point; it was just the first idea I had about how
> > > we
> > > could do it :). I'm more concerned with the idea and its security
> > > implications right now.
> > > 
> > > In general, it seems like people are ok with it protected by a new
> > > permission role. Do you have more to add, Mike? Was your comment based on
> > > your interpretation of how Accumulo works or more a concern about
> > > implementing such a feature?
> > > 
> > > On Oct 11, 2016 21:29, <dl...@comcast.net> wrote:
> > > 
> > > > So, to get the set of visibilities used in a table, we would have to
> > > > open
> > > > all of the rfiles?
> > > > 
> > > > > -----Original Message-----
> > > > > From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
> > > > > Sent: Tuesday, October 11, 2016 3:43 PM
> > > > > To: Accumulo Dev List
> > > > > Subject: Re: [DISCUSS] Would a visibility histogram on a table be
> > > > > harmful?
> > > > > 
> > > > > Interesting idea. It begs the question: should we allow any custom
> > > > > index at
> > > > > the RFile level? If RFile indexes were user-extensible, then a
> > > > > visibility index
> > > > > would be something any developer could write. That said, we can
> > > > > still
> > > > > include such an index as an example, and if we did it could be used
> > > > > by
> > > > > the
> > > > > Accumulo monitor.
> > > > > 
> > > > > The RFile-level sampling followed this path. I would support further
> > > > > work
> > > > > similar to it, though I admit I don't know how difficult a job it
> > > > > entails.
> > > > > Bonus points if the index information could be accessed from
> > > > > iterators
> > > > > the
> > > > > same way that sampled data can.
> > > > > 
> > > > > I can't speak to the appropriateness of visibility histograms on the
> > > > > monitor
> > > > > *by default*, but it would be a strictly useful feature if it could
> > > > > be
> > > > > enabled via
> > > > > a conf option.
> > > > > 
> > > > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <jo...@gmail.com>
> > > > > wrote:
> > > > > 
> > > > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
> > > > > > he
> > > > > > mentioned was the lack of insight into the distribution of data
> > > > > > marked
> > > > > > with certain visibilities in a table. He presented an example
> > > > > > similar
> > > > > > to this:
> > > > > > 
> > > > > > Image a hypothetical system backed by Accumulo which stores medical
> > > > > > information. There are three labels in the system: PRIVATE,
> > > > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
> > > > > > be
> > > > > > considered to identify the individual. ANONYMIZED data is some
> > > > > > altered
> > > > > > version of the attribute that retains some portion of the original
> > > > > > value, but is missing enough context to not identify the individual
> > > > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is
> > > > > > for
> > > > > > attributes which are cannot identify the individual.
> > > > > > 
> > > > > > Doctors would be able to read the PRIVATE data, while researchers
> > > > > > could only read the ANONYMIZED and PUBLIC data. This leads to a
> > > > > > question: how much of each kind of data is in the system? Without
> > > > > > knowing how much data is in the system, how can some application
> > > > > > developer (who does not have the ability to read all of the PRIVATE
> > > > > > data) know that their application is returning an reasonably
> > > > > > correct
> > > > > > amount of data? (there are many examples of questions which could
> > > > > > be
> > > > > > answer on this data alone)
> > > > > > 
> > > > > > Concretely, this histogram would look like (50 records with
> > > > > > PRIVATE,
> > > > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> > > > > > 
> > > > > > PRIVATE: 50
> > > > > > ANONYMIZED: 50
> > > > > > PUBLIC: 20
> > > > > > 
> > > > > > Technically, I think this would actually be relatively simple to
> > > > > > implement. Inside of each RFile, we could maintain some histogram
> > > > > > of
> > > > > > the visibilities observed in that file. This would allow us to very
> > > > > > easily report how much data in each table has each visibility
> > > > > > label.
> > > > > > 
> > > > > > However, would this feature be harmful to one of the core tenants
> > > > > > of
> > > > > > Accumulo? Or, is acknowledging the existence of data in Accumulo
> > > > > > with
> > > > > > a certain visibility acceptable? Would a new permission to use such
> > > > > > an
> > > > > > API to access this information be sufficient to protect the data?
> > > > > > 
> > > > > > *   Josh

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by "Marc P." <ma...@gmail.com>.
Beyond adding a tool on the side. It doesn't fit in metadata as that
requires aggregated reads vs table aggregates data.

On Wed, Oct 12, 2016, 11:02 AM Marc P. <ma...@gmail.com> wrote:

> How does it increase ease of use?
>
> On Wed, Oct 12, 2016, 10:34 AM ivan bella <iv...@ivan.bella.name> wrote:
>
> Yes the "owners" could create a visibility counting mechanism separately,
> however if we make this RFile metadata a part of the system then we
> increase the "ease of use". Unfortunately, system designers rarely think
> about the metadata they need from their system up front. That being said,
> if the performance impact of this is significant then it needs to be made
> optional or we leave it as is.
>
>
> > On October 12, 2016 at 7:12 AM "Marc P." <ma...@gmail.com> wrote:
> >
> >
> > What prevents the owners of the system from doing this in their own
> table?
> > Keeping track of that information is a use case of Accumulo. I think this
> > may be an example of external code that the user must install. Placing
> the
> > onus on the consumer mitigates concern that Mike "Mike" Drob and others
> may
> > have .
> >
> > A new role wouldn't be needed if permissions were placed on the
> > user/table/namespace that stored this information, correct?
> >
> > On Wed, Oct 12, 2016 at 12:56 AM, Christopher <ct...@apache.org>
> wrote:
> >
> > > Keith, Russ, myself (and possible others) were discussing this at the
> > > hackathon after the Accumulo Summit, and I think our consensus were
> > > basically this:
> > >
> > > We need a generic pluggable mechanism for injecting arbitrary user
> counters
> > > into the RFiles. We can then use these counters in custom compaction
> > > strategies, or other analysis. We can aggregate these counters at the
> > > tablet, and table levels, and expose them in the API.
> > >
> > > These counters could store information about visibility frequencies,
> number
> > > of delete entries, etc.
> > >
> > > The interface might just be a Function<Entry<Key,Value>,Map<String,
> > > Long>>.
> > >
> > > In the discussion, there were lots of variations on the theme, though.
> So,
> > > the actual implementation could vary. But, having something like this
> could
> > > support a large number of use cases beyond just the histogram case.
> > >
> > > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser <jo...@gmail.com>
> wrote:
> > >
> > > > Trivially. We could do something more intelligent like also cache it
> in
> > > > metadata (updating with compactions). Don't read too much into the
> > > > implementation at this point; it was just the first idea I had about
> how
> > > we
> > > > could do it :). I'm more concerned with the idea and its security
> > > > implications right now.
> > > >
> > > > In general, it seems like people are ok with it protected by a new
> > > > permission role. Do you have more to add, Mike? Was your comment
> based on
> > > > your interpretation of how Accumulo works or more a concern about
> > > > implementing such a feature?
> > > >
> > > > On Oct 11, 2016 21:29, <dl...@comcast.net> wrote:
> > > >
> > > > > So, to get the set of visibilities used in a table, we would have
> to
> > > open
> > > > > all of the rfiles?
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
> > > > > > Sent: Tuesday, October 11, 2016 3:43 PM
> > > > > > To: Accumulo Dev List
> > > > > > Subject: Re: [DISCUSS] Would a visibility histogram on a table be
> > > > > harmful?
> > > > > >
> > > > > > Interesting idea. It begs the question: should we allow any
> custom
> > > > > index at
> > > > > > the RFile level? If RFile indexes were user-extensible, then a
> > > > > visibility index
> > > > > > would be something any developer could write. That said, we can
> > > still
> > > > > > include such an index as an example, and if we did it could be
> used
> > > by
> > > > > the
> > > > > > Accumulo monitor.
> > > > > >
> > > > > > The RFile-level sampling followed this path. I would support
> further
> > > > > work
> > > > > > similar to it, though I admit I don't know how difficult a job it
> > > > > entails.
> > > > > > Bonus points if the index information could be accessed from
> > > iterators
> > > > > the
> > > > > > same way that sampled data can.
> > > > > >
> > > > > > I can't speak to the appropriateness of visibility histograms on
> the
> > > > > monitor
> > > > > > *by default*, but it would be a strictly useful feature if it
> could
> > > be
> > > > > enabled via
> > > > > > a conf option.
> > > > > >
> > > > > >
> > > > > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <
> josh.elser@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One
> topic
> > > > he
> > > > > > > mentioned was the lack of insight into the distribution of data
> > > > marked
> > > > > > > with certain visibilities in a table. He presented an example
> > > similar
> > > > > to this:
> > > > > > >
> > > > > > > Image a hypothetical system backed by Accumulo which stores
> medical
> > > > > > > information. There are three labels in the system: PRIVATE,
> > > > > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could
> reasonably
> > > > be
> > > > > > > considered to identify the individual. ANONYMIZED data is some
> > > > altered
> > > > > > > version of the attribute that retains some portion of the
> original
> > > > > > > value, but is missing enough context to not identify the
> individual
> > > > > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data
> is
> > > for
> > > > > > > attributes which are cannot identify the individual.
> > > > > > >
> > > > > > > Doctors would be able to read the PRIVATE data, while
> researchers
> > > > > > > could only read the ANONYMIZED and PUBLIC data. This leads to a
> > > > > > > question: how much of each kind of data is in the system?
> Without
> > > > > > > knowing how much data is in the system, how can some
> application
> > > > > > > developer (who does not have the ability to read all of the
> PRIVATE
> > > > > > > data) know that their application is returning an reasonably
> > > correct
> > > > > > > amount of data? (there are many examples of questions which
> could
> > > be
> > > > > > > answer on this data alone)
> > > > > > >
> > > > > > > Concretely, this histogram would look like (50 records with
> > > PRIVATE,
> > > > > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> > > > > > >
> > > > > > > ```
> > > > > > > PRIVATE: 50
> > > > > > > ANONYMIZED: 50
> > > > > > > PUBLIC: 20
> > > > > > > ```
> > > > > > >
> > > > > > > Technically, I think this would actually be relatively simple
> to
> > > > > > > implement. Inside of each RFile, we could maintain some
> histogram
> > > of
> > > > > > > the visibilities observed in that file. This would allow us to
> very
> > > > > > > easily report how much data in each table has each visibility
> > > label.
> > > > > > >
> > > > > > > However, would this feature be harmful to one of the core
> tenants
> > > of
> > > > > > > Accumulo? Or, is acknowledging the existence of data in
> Accumulo
> > > with
> > > > > > > a certain visibility acceptable? Would a new permission to use
> such
> > > > an
> > > > > > > API to access this information be sufficient to protect the
> data?
> > > > > > >
> > > > > > > - Josh
> > > > > > >
> > > > >
> > > > >
> > > >
> > >
>
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by "Marc P." <ma...@gmail.com>.
How does it increase ease of use?

On Wed, Oct 12, 2016, 10:34 AM ivan bella <iv...@ivan.bella.name> wrote:

> Yes the "owners" could create a visibility counting mechanism separately,
> however if we make this RFile metadata a part of the system then we
> increase the "ease of use". Unfortunately, system designers rarely think
> about the metadata they need from their system up front. That being said,
> if the performance impact of this is significant then it needs to be made
> optional or we leave it as is.
>
>
> > On October 12, 2016 at 7:12 AM "Marc P." <ma...@gmail.com> wrote:
> >
> >
> > What prevents the owners of the system from doing this in their own
> table?
> > Keeping track of that information is a use case of Accumulo. I think this
> > may be an example of external code that the user must install. Placing
> the
> > onus on the consumer mitigates concern that Mike "Mike" Drob and others
> may
> > have .
> >
> > A new role wouldn't be needed if permissions were placed on the
> > user/table/namespace that stored this information, correct?
> >
> > On Wed, Oct 12, 2016 at 12:56 AM, Christopher <ct...@apache.org>
> wrote:
> >
> > > Keith, Russ, myself (and possible others) were discussing this at the
> > > hackathon after the Accumulo Summit, and I think our consensus were
> > > basically this:
> > >
> > > We need a generic pluggable mechanism for injecting arbitrary user
> counters
> > > into the RFiles. We can then use these counters in custom compaction
> > > strategies, or other analysis. We can aggregate these counters at the
> > > tablet, and table levels, and expose them in the API.
> > >
> > > These counters could store information about visibility frequencies,
> number
> > > of delete entries, etc.
> > >
> > > The interface might just be a Function<Entry<Key,Value>,Map<String,
> > > Long>>.
> > >
> > > In the discussion, there were lots of variations on the theme, though.
> So,
> > > the actual implementation could vary. But, having something like this
> could
> > > support a large number of use cases beyond just the histogram case.
> > >
> > > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser <jo...@gmail.com>
> wrote:
> > >
> > > > Trivially. We could do something more intelligent like also cache it
> in
> > > > metadata (updating with compactions). Don't read too much into the
> > > > implementation at this point; it was just the first idea I had about
> how
> > > we
> > > > could do it :). I'm more concerned with the idea and its security
> > > > implications right now.
> > > >
> > > > In general, it seems like people are ok with it protected by a new
> > > > permission role. Do you have more to add, Mike? Was your comment
> based on
> > > > your interpretation of how Accumulo works or more a concern about
> > > > implementing such a feature?
> > > >
> > > > On Oct 11, 2016 21:29, <dl...@comcast.net> wrote:
> > > >
> > > > > So, to get the set of visibilities used in a table, we would have
> to
> > > open
> > > > > all of the rfiles?
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
> > > > > > Sent: Tuesday, October 11, 2016 3:43 PM
> > > > > > To: Accumulo Dev List
> > > > > > Subject: Re: [DISCUSS] Would a visibility histogram on a table be
> > > > > harmful?
> > > > > >
> > > > > > Interesting idea. It begs the question: should we allow any
> custom
> > > > > index at
> > > > > > the RFile level? If RFile indexes were user-extensible, then a
> > > > > visibility index
> > > > > > would be something any developer could write. That said, we can
> > > still
> > > > > > include such an index as an example, and if we did it could be
> used
> > > by
> > > > > the
> > > > > > Accumulo monitor.
> > > > > >
> > > > > > The RFile-level sampling followed this path. I would support
> further
> > > > > work
> > > > > > similar to it, though I admit I don't know how difficult a job it
> > > > > entails.
> > > > > > Bonus points if the index information could be accessed from
> > > iterators
> > > > > the
> > > > > > same way that sampled data can.
> > > > > >
> > > > > > I can't speak to the appropriateness of visibility histograms on
> the
> > > > > monitor
> > > > > > *by default*, but it would be a strictly useful feature if it
> could
> > > be
> > > > > enabled via
> > > > > > a conf option.
> > > > > >
> > > > > >
> > > > > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <
> josh.elser@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One
> topic
> > > > he
> > > > > > > mentioned was the lack of insight into the distribution of data
> > > > marked
> > > > > > > with certain visibilities in a table. He presented an example
> > > similar
> > > > > to this:
> > > > > > >
> > > > > > > Image a hypothetical system backed by Accumulo which stores
> medical
> > > > > > > information. There are three labels in the system: PRIVATE,
> > > > > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could
> reasonably
> > > > be
> > > > > > > considered to identify the individual. ANONYMIZED data is some
> > > > altered
> > > > > > > version of the attribute that retains some portion of the
> original
> > > > > > > value, but is missing enough context to not identify the
> individual
> > > > > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data
> is
> > > for
> > > > > > > attributes which are cannot identify the individual.
> > > > > > >
> > > > > > > Doctors would be able to read the PRIVATE data, while
> researchers
> > > > > > > could only read the ANONYMIZED and PUBLIC data. This leads to a
> > > > > > > question: how much of each kind of data is in the system?
> Without
> > > > > > > knowing how much data is in the system, how can some
> application
> > > > > > > developer (who does not have the ability to read all of the
> PRIVATE
> > > > > > > data) know that their application is returning an reasonably
> > > correct
> > > > > > > amount of data? (there are many examples of questions which
> could
> > > be
> > > > > > > answer on this data alone)
> > > > > > >
> > > > > > > Concretely, this histogram would look like (50 records with
> > > PRIVATE,
> > > > > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> > > > > > >
> > > > > > > ```
> > > > > > > PRIVATE: 50
> > > > > > > ANONYMIZED: 50
> > > > > > > PUBLIC: 20
> > > > > > > ```
> > > > > > >
> > > > > > > Technically, I think this would actually be relatively simple
> to
> > > > > > > implement. Inside of each RFile, we could maintain some
> histogram
> > > of
> > > > > > > the visibilities observed in that file. This would allow us to
> very
> > > > > > > easily report how much data in each table has each visibility
> > > label.
> > > > > > >
> > > > > > > However, would this feature be harmful to one of the core
> tenants
> > > of
> > > > > > > Accumulo? Or, is acknowledging the existence of data in
> Accumulo
> > > with
> > > > > > > a certain visibility acceptable? Would a new permission to use
> such
> > > > an
> > > > > > > API to access this information be sufficient to protect the
> data?
> > > > > > >
> > > > > > > - Josh
> > > > > > >
> > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by "Marc P." <ma...@gmail.com>.
What prevents the owners of the system from doing this in their own table?
Keeping track of that information is a use case of Accumulo. I think this
may be an example of external code that the user must install. Placing the
onus on the consumer mitigates concern that Mike "Mike" Drob and others may
have .

A new role wouldn't be needed if permissions were placed on the
user/table/namespace that stored this information, correct?

On Wed, Oct 12, 2016 at 12:56 AM, Christopher <ct...@apache.org> wrote:

> Keith, Russ, myself (and possible others) were discussing this at the
> hackathon after the Accumulo Summit, and I think our consensus were
> basically this:
>
> We need a generic pluggable mechanism for injecting arbitrary user counters
> into the RFiles. We can then use these counters in custom compaction
> strategies, or other analysis. We can aggregate these counters at the
> tablet, and table levels, and expose them in the API.
>
> These counters could store information about visibility frequencies, number
> of delete entries, etc.
>
> The interface might just be a Function<Entry<Key,Value>,Map<String,
> Long>>.
>
> In the discussion, there were lots of variations on the theme, though. So,
> the actual implementation could vary. But, having something like this could
> support a large number of use cases beyond just the histogram case.
>
> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser <jo...@gmail.com> wrote:
>
> > Trivially. We could do something more intelligent like also cache it in
> > metadata (updating with compactions). Don't read too much into the
> > implementation at this point; it was just the first idea I had about how
> we
> > could do it :). I'm more concerned with the idea and its security
> > implications right now.
> >
> > In general, it seems like people are ok with it protected by a new
> > permission role. Do you have more to add, Mike? Was your comment based on
> > your interpretation of how Accumulo works or more a concern about
> > implementing such a feature?
> >
> > On Oct 11, 2016 21:29, <dl...@comcast.net> wrote:
> >
> > > So, to get the set of visibilities used in a table, we would have to
> open
> > > all of the rfiles?
> > >
> > > > -----Original Message-----
> > > > From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
> > > > Sent: Tuesday, October 11, 2016 3:43 PM
> > > > To: Accumulo Dev List
> > > > Subject: Re: [DISCUSS] Would a visibility histogram on a table be
> > > harmful?
> > > >
> > > > Interesting idea.  It begs the question: should we allow any custom
> > > index at
> > > > the RFile level?  If RFile indexes were user-extensible, then a
> > > visibility index
> > > > would be something any developer could write.  That said, we can
> still
> > > > include such an index as an example, and if we did it could be used
> by
> > > the
> > > > Accumulo monitor.
> > > >
> > > > The RFile-level sampling followed this path.  I would support further
> > > work
> > > > similar to it, though I admit I don't know how difficult a job it
> > > entails.
> > > > Bonus points if the index information could be accessed from
> iterators
> > > the
> > > > same way that sampled data can.
> > > >
> > > > I can't speak to the appropriateness of visibility histograms on the
> > > monitor
> > > > *by default*, but it would be a strictly useful feature if it could
> be
> > > enabled via
> > > > a conf option.
> > > >
> > > >
> > > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <jo...@gmail.com>
> > > wrote:
> > > >
> > > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
> > he
> > > > > mentioned was the lack of insight into the distribution of data
> > marked
> > > > > with certain visibilities in a table. He presented an example
> similar
> > > to this:
> > > > >
> > > > > Image a hypothetical system backed by Accumulo which stores medical
> > > > > information. There are three labels in the system: PRIVATE,
> > > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
> > be
> > > > > considered to identify the individual. ANONYMIZED data is some
> > altered
> > > > > version of the attribute that retains some portion of the original
> > > > > value, but is missing enough context to not identify the individual
> > > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is
> for
> > > > > attributes which are cannot identify the individual.
> > > > >
> > > > > Doctors would be able to read the PRIVATE data, while researchers
> > > > > could only read the ANONYMIZED and PUBLIC data. This leads to a
> > > > > question: how much of each kind of data is in the system? Without
> > > > > knowing how much data is in the system, how can some application
> > > > > developer (who does not have the ability to read all of the PRIVATE
> > > > > data) know that their application is returning an reasonably
> correct
> > > > > amount of data? (there are many examples of questions which could
> be
> > > > > answer on this data alone)
> > > > >
> > > > > Concretely, this histogram would look like (50 records with
> PRIVATE,
> > > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> > > > >
> > > > > ```
> > > > > PRIVATE: 50
> > > > > ANONYMIZED: 50
> > > > > PUBLIC: 20
> > > > > ```
> > > > >
> > > > > Technically, I think this would actually be relatively simple to
> > > > > implement. Inside of each RFile, we could maintain some histogram
> of
> > > > > the visibilities observed in that file. This would allow us to very
> > > > > easily report how much data in each table has each visibility
> label.
> > > > >
> > > > > However, would this feature be harmful to one of the core tenants
> of
> > > > > Accumulo? Or, is acknowledging the existence of data in Accumulo
> with
> > > > > a certain visibility acceptable? Would a new permission to use such
> > an
> > > > > API to access this information be sufficient to protect the data?
> > > > >
> > > > > - Josh
> > > > >
> > >
> > >
> >
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Christopher <ct...@apache.org>.
Keith, Russ, myself (and possible others) were discussing this at the
hackathon after the Accumulo Summit, and I think our consensus were
basically this:

We need a generic pluggable mechanism for injecting arbitrary user counters
into the RFiles. We can then use these counters in custom compaction
strategies, or other analysis. We can aggregate these counters at the
tablet, and table levels, and expose them in the API.

These counters could store information about visibility frequencies, number
of delete entries, etc.

The interface might just be a Function<Entry<Key,Value>,Map<String, Long>>.

In the discussion, there were lots of variations on the theme, though. So,
the actual implementation could vary. But, having something like this could
support a large number of use cases beyond just the histogram case.

On Tue, Oct 11, 2016 at 10:06 PM Josh Elser <jo...@gmail.com> wrote:

> Trivially. We could do something more intelligent like also cache it in
> metadata (updating with compactions). Don't read too much into the
> implementation at this point; it was just the first idea I had about how we
> could do it :). I'm more concerned with the idea and its security
> implications right now.
>
> In general, it seems like people are ok with it protected by a new
> permission role. Do you have more to add, Mike? Was your comment based on
> your interpretation of how Accumulo works or more a concern about
> implementing such a feature?
>
> On Oct 11, 2016 21:29, <dl...@comcast.net> wrote:
>
> > So, to get the set of visibilities used in a table, we would have to open
> > all of the rfiles?
> >
> > > -----Original Message-----
> > > From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
> > > Sent: Tuesday, October 11, 2016 3:43 PM
> > > To: Accumulo Dev List
> > > Subject: Re: [DISCUSS] Would a visibility histogram on a table be
> > harmful?
> > >
> > > Interesting idea.  It begs the question: should we allow any custom
> > index at
> > > the RFile level?  If RFile indexes were user-extensible, then a
> > visibility index
> > > would be something any developer could write.  That said, we can still
> > > include such an index as an example, and if we did it could be used by
> > the
> > > Accumulo monitor.
> > >
> > > The RFile-level sampling followed this path.  I would support further
> > work
> > > similar to it, though I admit I don't know how difficult a job it
> > entails.
> > > Bonus points if the index information could be accessed from iterators
> > the
> > > same way that sampled data can.
> > >
> > > I can't speak to the appropriateness of visibility histograms on the
> > monitor
> > > *by default*, but it would be a strictly useful feature if it could be
> > enabled via
> > > a conf option.
> > >
> > >
> > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <jo...@gmail.com>
> > wrote:
> > >
> > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
> he
> > > > mentioned was the lack of insight into the distribution of data
> marked
> > > > with certain visibilities in a table. He presented an example similar
> > to this:
> > > >
> > > > Image a hypothetical system backed by Accumulo which stores medical
> > > > information. There are three labels in the system: PRIVATE,
> > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
> be
> > > > considered to identify the individual. ANONYMIZED data is some
> altered
> > > > version of the attribute that retains some portion of the original
> > > > value, but is missing enough context to not identify the individual
> > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
> > > > attributes which are cannot identify the individual.
> > > >
> > > > Doctors would be able to read the PRIVATE data, while researchers
> > > > could only read the ANONYMIZED and PUBLIC data. This leads to a
> > > > question: how much of each kind of data is in the system? Without
> > > > knowing how much data is in the system, how can some application
> > > > developer (who does not have the ability to read all of the PRIVATE
> > > > data) know that their application is returning an reasonably correct
> > > > amount of data? (there are many examples of questions which could be
> > > > answer on this data alone)
> > > >
> > > > Concretely, this histogram would look like (50 records with PRIVATE,
> > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> > > >
> > > > ```
> > > > PRIVATE: 50
> > > > ANONYMIZED: 50
> > > > PUBLIC: 20
> > > > ```
> > > >
> > > > Technically, I think this would actually be relatively simple to
> > > > implement. Inside of each RFile, we could maintain some histogram of
> > > > the visibilities observed in that file. This would allow us to very
> > > > easily report how much data in each table has each visibility label.
> > > >
> > > > However, would this feature be harmful to one of the core tenants of
> > > > Accumulo? Or, is acknowledging the existence of data in Accumulo with
> > > > a certain visibility acceptable? Would a new permission to use such
> an
> > > > API to access this information be sufficient to protect the data?
> > > >
> > > > - Josh
> > > >
> >
> >
>

RE: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Josh Elser <jo...@gmail.com>.
Trivially. We could do something more intelligent like also cache it in
metadata (updating with compactions). Don't read too much into the
implementation at this point; it was just the first idea I had about how we
could do it :). I'm more concerned with the idea and its security
implications right now.

In general, it seems like people are ok with it protected by a new
permission role. Do you have more to add, Mike? Was your comment based on
your interpretation of how Accumulo works or more a concern about
implementing such a feature?

On Oct 11, 2016 21:29, <dl...@comcast.net> wrote:

> So, to get the set of visibilities used in a table, we would have to open
> all of the rfiles?
>
> > -----Original Message-----
> > From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
> > Sent: Tuesday, October 11, 2016 3:43 PM
> > To: Accumulo Dev List
> > Subject: Re: [DISCUSS] Would a visibility histogram on a table be
> harmful?
> >
> > Interesting idea.  It begs the question: should we allow any custom
> index at
> > the RFile level?  If RFile indexes were user-extensible, then a
> visibility index
> > would be something any developer could write.  That said, we can still
> > include such an index as an example, and if we did it could be used by
> the
> > Accumulo monitor.
> >
> > The RFile-level sampling followed this path.  I would support further
> work
> > similar to it, though I admit I don't know how difficult a job it
> entails.
> > Bonus points if the index information could be accessed from iterators
> the
> > same way that sampled data can.
> >
> > I can't speak to the appropriateness of visibility histograms on the
> monitor
> > *by default*, but it would be a strictly useful feature if it could be
> enabled via
> > a conf option.
> >
> >
> > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <jo...@gmail.com>
> wrote:
> >
> > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he
> > > mentioned was the lack of insight into the distribution of data marked
> > > with certain visibilities in a table. He presented an example similar
> to this:
> > >
> > > Image a hypothetical system backed by Accumulo which stores medical
> > > information. There are three labels in the system: PRIVATE,
> > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably be
> > > considered to identify the individual. ANONYMIZED data is some altered
> > > version of the attribute that retains some portion of the original
> > > value, but is missing enough context to not identify the individual
> > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
> > > attributes which are cannot identify the individual.
> > >
> > > Doctors would be able to read the PRIVATE data, while researchers
> > > could only read the ANONYMIZED and PUBLIC data. This leads to a
> > > question: how much of each kind of data is in the system? Without
> > > knowing how much data is in the system, how can some application
> > > developer (who does not have the ability to read all of the PRIVATE
> > > data) know that their application is returning an reasonably correct
> > > amount of data? (there are many examples of questions which could be
> > > answer on this data alone)
> > >
> > > Concretely, this histogram would look like (50 records with PRIVATE,
> > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> > >
> > > ```
> > > PRIVATE: 50
> > > ANONYMIZED: 50
> > > PUBLIC: 20
> > > ```
> > >
> > > Technically, I think this would actually be relatively simple to
> > > implement. Inside of each RFile, we could maintain some histogram of
> > > the visibilities observed in that file. This would allow us to very
> > > easily report how much data in each table has each visibility label.
> > >
> > > However, would this feature be harmful to one of the core tenants of
> > > Accumulo? Or, is acknowledging the existence of data in Accumulo with
> > > a certain visibility acceptable? Would a new permission to use such an
> > > API to access this information be sufficient to protect the data?
> > >
> > > - Josh
> > >
>
>

RE: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by dl...@comcast.net.
So, to get the set of visibilities used in a table, we would have to open all of the rfiles?

> -----Original Message-----
> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
> Sent: Tuesday, October 11, 2016 3:43 PM
> To: Accumulo Dev List
> Subject: Re: [DISCUSS] Would a visibility histogram on a table be harmful?
> 
> Interesting idea.  It begs the question: should we allow any custom index at
> the RFile level?  If RFile indexes were user-extensible, then a visibility index
> would be something any developer could write.  That said, we can still
> include such an index as an example, and if we did it could be used by the
> Accumulo monitor.
> 
> The RFile-level sampling followed this path.  I would support further work
> similar to it, though I admit I don't know how difficult a job it entails.
> Bonus points if the index information could be accessed from iterators the
> same way that sampled data can.
> 
> I can't speak to the appropriateness of visibility histograms on the monitor
> *by default*, but it would be a strictly useful feature if it could be enabled via
> a conf option.
> 
> 
> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <jo...@gmail.com> wrote:
> 
> > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he
> > mentioned was the lack of insight into the distribution of data marked
> > with certain visibilities in a table. He presented an example similar to this:
> >
> > Image a hypothetical system backed by Accumulo which stores medical
> > information. There are three labels in the system: PRIVATE,
> > ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably be
> > considered to identify the individual. ANONYMIZED data is some altered
> > version of the attribute that retains some portion of the original
> > value, but is missing enough context to not identify the individual
> > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
> > attributes which are cannot identify the individual.
> >
> > Doctors would be able to read the PRIVATE data, while researchers
> > could only read the ANONYMIZED and PUBLIC data. This leads to a
> > question: how much of each kind of data is in the system? Without
> > knowing how much data is in the system, how can some application
> > developer (who does not have the ability to read all of the PRIVATE
> > data) know that their application is returning an reasonably correct
> > amount of data? (there are many examples of questions which could be
> > answer on this data alone)
> >
> > Concretely, this histogram would look like (50 records with PRIVATE,
> > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> >
> > ```
> > PRIVATE: 50
> > ANONYMIZED: 50
> > PUBLIC: 20
> > ```
> >
> > Technically, I think this would actually be relatively simple to
> > implement. Inside of each RFile, we could maintain some histogram of
> > the visibilities observed in that file. This would allow us to very
> > easily report how much data in each table has each visibility label.
> >
> > However, would this feature be harmful to one of the core tenants of
> > Accumulo? Or, is acknowledging the existence of data in Accumulo with
> > a certain visibility acceptable? Would a new permission to use such an
> > API to access this information be sufficient to protect the data?
> >
> > - Josh
> >


Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Posted by Dylan Hutchison <dh...@cs.washington.edu>.
Interesting idea.  It begs the question: should we allow any custom index
at the RFile level?  If RFile indexes were user-extensible, then a
visibility index would be something any developer could write.  That said,
we can still include such an index as an example, and if we did it could be
used by the Accumulo monitor.

The RFile-level sampling followed this path.  I would support further work
similar to it, though I admit I don't know how difficult a job it entails.
Bonus points if the index information could be accessed from iterators the
same way that sampled data can.

I can't speak to the appropriateness of visibility histograms on the
monitor *by default*, but it would be a strictly useful feature if it could
be enabled via a conf option.


On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <jo...@gmail.com> wrote:

> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he
> mentioned was the lack of insight into the distribution of data marked with
> certain visibilities in a table. He presented an example similar to this:
>
> Image a hypothetical system backed by Accumulo which stores medical
> information. There are three labels in the system: PRIVATE, ANONYMIZED, and
> PUBLIC. PRIVATE data is that which could reasonably be considered to
> identify the individual. ANONYMIZED data is some altered version of the
> attribute that retains some portion of the original value, but is missing
> enough context to not identify the individual (e.g. converting the name
> "Josh Elser" to "J E"). PUBLIC data is for attributes which are cannot
> identify the individual.
>
> Doctors would be able to read the PRIVATE data, while researchers could
> only read the ANONYMIZED and PUBLIC data. This leads to a question: how
> much of each kind of data is in the system? Without knowing how much data
> is in the system, how can some application developer (who does not have the
> ability to read all of the PRIVATE data) know that their application is
> returning an reasonably correct amount of data? (there are many examples of
> questions which could be answer on this data alone)
>
> Concretely, this histogram would look like (50 records with PRIVATE, 50
> with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>
> ```
> PRIVATE: 50
> ANONYMIZED: 50
> PUBLIC: 20
> ```
>
> Technically, I think this would actually be relatively simple to
> implement. Inside of each RFile, we could maintain some histogram of the
> visibilities observed in that file. This would allow us to very easily
> report how much data in each table has each visibility label.
>
> However, would this feature be harmful to one of the core tenants of
> Accumulo? Or, is acknowledging the existence of data in Accumulo with a
> certain visibility acceptable? Would a new permission to use such an API to
> access this information be sufficient to protect the data?
>
> - Josh
>